Segui questo link per vedere altri tipi di pubblicazioni sul tema: Dataset selection.

Tesi sul tema "Dataset selection"

Cita una fonte nei formati APA, MLA, Chicago, Harvard e in molti altri stili

Scegli il tipo di fonte:

Vedi i top-27 saggi (tesi di laurea o di dottorato) per l'attività di ricerca sul tema "Dataset selection".

Accanto a ogni fonte nell'elenco di riferimenti c'è un pulsante "Aggiungi alla bibliografia". Premilo e genereremo automaticamente la citazione bibliografica dell'opera scelta nello stile citazionale di cui hai bisogno: APA, MLA, Harvard, Chicago, Vancouver ecc.

Puoi anche scaricare il testo completo della pubblicazione scientifica nel formato .pdf e leggere online l'abstract (il sommario) dell'opera se è presente nei metadati.

Vedi le tesi di molte aree scientifiche e compila una bibliografia corretta.

1

Sousa, Massáine Bandeira e. "Improving accuracy of genomic prediction in maize single-crosses through different kernels and reducing the marker dataset". Universidade de São Paulo, 2017. http://www.teses.usp.br/teses/disponiveis/11/11137/tde-07032018-163203/.

Testo completo
Abstract (sommario):
In plant breeding, genomic prediction (GP) may be an efficient tool to increase the accuracy of selecting genotypes, mainly, under multi-environments trials. This approach has the advantage to increase genetic gains of complex traits and reduce costs. However, strategies are needed to increase the accuracy and reduce the bias of genomic estimated breeding values. In this context, the objectives were: i) to compare two strategies to obtain markers subsets based on marker effect regarding their impact on the prediction accuracy of genome selection; and, ii) to compare the accuracy of four GP methods including genotype × environment interaction and two kernels (GBLUP and Gaussian). We used a rice diversity panel (RICE) and two maize datasets (HEL and USP). These were evaluated for grain yield and plant height. Overall, the prediction accuracy and relative efficiency of genomic selection were increased using markers subsets, which has the potential for build fixed arrays and reduce costs with genotyping. Furthermore, using Gaussian kernel and the including G×E effect, there is an increase in the accuracy of the genomic prediction models.
No melhoramento de plantas, a predição genômica (PG) é uma eficiente ferramenta para aumentar a eficiência seletiva de genótipos, principalmente, considerando múltiplos ambientes. Esta técnica tem como vantagem incrementar o ganho genético para características complexas e reduzir os custos. Entretanto, ainda são necessárias estratégias que aumentem a acurácia e reduzam o viés dos valores genéticos genotípicos. Nesse contexto, os objetivos foram: i) comparar duas estratégias para obtenção de subconjuntos de marcadores baseado em seus efeitos em relação ao seu impacto na acurácia da seleção genômica; ii) comparar a acurácia seletiva de quatro modelos de PG incluindo o efeito de interação genótipo × ambiente (G×A) e dois kernels (GBLUP e Gaussiano). Para isso, foram usados dados de um painel de diversidade de arroz (RICE) e dois conjuntos de dados de milho (HEL e USP). Estes foram avaliados para produtividade de grãos e altura de plantas. Em geral, houve incremento da acurácia de predição e na eficiência da seleção genômica usando subconjuntos de marcadores. Estes poderiam ser utilizados para construção de arrays e, consequentemente, reduzir os custos com genotipagem. Além disso, utilizando o kernel Gaussiano e incluindo o efeito de interação G×A há aumento na acurácia dos modelos de predição genômica.
Gli stili APA, Harvard, Vancouver, ISO e altri
2

Awwad, Tarek. "Context-aware worker selection for efficient quality control in crowdsourcing". Thesis, Lyon, 2018. http://www.theses.fr/2018LYSEI099/document.

Testo completo
Abstract (sommario):
Le crowdsourcing est une technique qui permet de recueillir une large quantité de données d'une manière rapide et peu onéreuse. Néanmoins, La disparité comportementale et de performances des "workers" d’une part et la variété en termes de contenu et de présentation des tâches par ailleurs influent considérablement sur la qualité des contributions recueillies. Par conséquent, garder leur légitimité impose aux plateformes de crowdsourcing de se doter de mécanismes permettant l’obtention de réponses fiables et de qualité dans un délai et avec un budget optimisé. Dans cette thèse, nous proposons CAWS (Context AwareWorker Selection), une méthode de contrôle de la qualité des contributions dans le crowdsourcing visant à optimiser le délai de réponse et le coût des campagnes. CAWS se compose de deux phases, une phase d’apprentissage opérant hors-ligne et pendant laquelle les tâches de l’historique sont regroupées de manière homogène sous forme de clusters. Pour chaque cluster, un profil type optimisant la qualité des réponses aux tâches le composant, est inféré ; la seconde phase permet à l’arrivée d’une nouvelle tâche de sélectionner les meilleurs workers connectés pour y répondre. Il s’agit des workers dont le profil présente une forte similarité avec le profil type du cluster de tâches, duquel la tâche nouvellement créée est la plus proche. La seconde contribution de la thèse est de proposer un jeu de données, appelé CrowdED (Crowdsourcing Evaluation Dataset), ayant les propriétés requises pour, d’une part, tester les performances de CAWS et les comparer aux méthodes concurrentes et d’autre part, pour tester et comparer l’impact des différentes méthodes de catégorisation des tâches de l’historique (c-à-d, la méthode de vectorisation et l’algorithme de clustering utilisé) sur la qualité du résultat, tout en utilisant un jeu de tâches unique (obtenu par échantillonnage), respectant les contraintes budgétaires et gardant les propriétés de validité en terme de dimension. En outre, CrowdED rend possible la comparaison de méthodes de contrôle de qualité quelle que soient leurs catégories, du fait du respect d’un cahier des charges lors de sa constitution. Les résultats de l’évaluation de CAWS en utilisant CrowdED comparés aux méthodes concurrentes basées sur la sélection de workers, donnent des résultats meilleurs, surtout en cas de contraintes temporelles et budgétaires fortes. Les expérimentations réalisées avec un historique structuré en catégories donnent des résultats comparables à des jeux de données où les taches sont volontairement regroupées de manière homogène. La dernière contribution de la thèse est un outil appelé CREX (CReate Enrich eXtend) dont le rôle est de permettre la création, l’extension ou l’enrichissement de jeux de données destinés à tester des méthodes de crowdsourcing. Il propose des modules extensibles de vectorisation, de clusterisation et d’échantillonnages et permet une génération automatique d’une campagne de crowdsourcing
Crowdsourcing has proved its ability to address large scale data collection tasks at a low cost and in a short time. However, due to the dependence on unknown workers, the quality of the crowdsourcing process is questionable and must be controlled. Indeed, maintaining the efficiency of crowdsourcing requires the time and cost overhead related to this quality control to stay low. Current quality control techniques suffer from high time and budget overheads and from their dependency on prior knowledge about individual workers. In this thesis, we address these limitation by proposing the CAWS (Context-Aware Worker Selection) method which operates in two phases: in an offline phase, the correlations between the worker declarative profiles and the task types are learned. Then, in an online phase, the learned profile models are used to select the most reliable online workers for the incoming tasks depending on their types. Using declarative profiles helps eliminate any probing process, which reduces the time and the budget while maintaining the crowdsourcing quality. In order to evaluate CAWS, we introduce an information-rich dataset called CrowdED (Crowdsourcing Evaluation Dataset). The generation of CrowdED relies on a constrained sampling approach that allows to produce a dataset which respects the requester budget and type constraints. Through its generality and richness, CrowdED helps also in plugging the benchmarking gap present in the crowdsourcing community. Using CrowdED, we evaluate the performance of CAWS in terms of the quality, the time and the budget gain. Results shows that automatic grouping is able to achieve a learning quality similar to job-based grouping, and that CAWS is able to outperform the state-of-the-art profile-based worker selection when it comes to quality, especially when strong budget ant time constraints exist. Finally, we propose CREX (CReate Enrich eXtend) which provides the tools to select and sample input tasks and to automatically generate custom crowdsourcing campaign sites in order to extend and enrich CrowdED
Gli stili APA, Harvard, Vancouver, ISO e altri
3

Lingle, Jeremy Andrew. "Evaluating the Performance of Propensity Scores to Address Selection Bias in a Multilevel Context: A Monte Carlo Simulation Study and Application Using a National Dataset". Digital Archive @ GSU, 2009. http://digitalarchive.gsu.edu/eps_diss/56.

Testo completo
Abstract (sommario):
When researchers are unable to randomly assign students to treatment conditions, selection bias is introduced into the estimates of treatment effects. Random assignment to treatment conditions, which has historically been the scientific benchmark for causal inference, is often impossible or unethical to implement in educational systems. For example, researchers cannot deny services to those who stand to gain from participation in an academic program. Additionally, students select into a particular treatment group through processes that are impossible to control, such as those that result in a child dropping-out of high school or attending a resource-starved school. Propensity score methods provide valuable tools for removing the selection bias from quasi-experimental research designs and observational studies through modeling the treatment assignment mechanism. The utility of propensity scores has been validated for the purposes of removing selection bias when the observations are assumed to be independent; however, the ability of propensity scores to remove selection bias in a multilevel context, in which group membership plays a role in the treatment assignment, is relatively unknown. A central purpose of the current study was to begin filling in the gaps in knowledge regarding the performance of propensity scores for removing selection bias, as defined by covariate balance, in multilevel settings using a Monte Carlo simulation study. The performance of propensity scores were also examined using a large-scale national dataset. Results from this study provide support for the conclusion that multilevel characteristics of a sample have a bearing upon the performance of propensity scores to balance covariates between treatment and control groups. Findings suggest that propensity score estimation models should take into account the cluster-level effects when working with multilevel data; however, the numbers of treatment and control group individuals within each cluster must be sufficiently large to allow estimation of those effects. Propensity scores that take into account the cluster-level effects can have the added benefit of balancing covariates within each cluster as well as across the sample as a whole.
Gli stili APA, Harvard, Vancouver, ISO e altri
4

Zoghi, Zeinab. "Ensemble Classifier Design and Performance Evaluation for Intrusion Detection Using UNSW-NB15 Dataset". University of Toledo / OhioLINK, 2020. http://rave.ohiolink.edu/etdc/view?acc_num=toledo1596756673292254.

Testo completo
Gli stili APA, Harvard, Vancouver, ISO e altri
5

Silva, Wilbor Poletti. "Archaeomagnetic field intensity evolution during the last two millennia". Universidade de São Paulo, 2018. http://www.teses.usp.br/teses/disponiveis/14/14132/tde-19092018-135335/.

Testo completo
Abstract (sommario):
Temporal variations of Earth\'s magnetic field provide a great range of geophysical information about the dynamics at different layers of the Earth. Since it is a planetary field, regional and global aspects can be explored, depending on the timescale of variations. In this thesis, the geomagnetic field variations for the last two millennia were investigated. For that, some improvement on the methods to recover the ancient magnetic field intensity from archeological material were done, new data was acquired and a critical assessment of the global archaeomagnetic database was performed. Two methodological advances are reported, comprising: i) the correction for microwave method of the cooling rate effect, which is associated to the difference between the cooling times during the manufactory of the material and that of the heating steps during the archaeointensity experiment; (ii) a test for thermoremanent anisotropy correction from the arithmetic mean of six orthogonal samples. The temporal variation of the magnetic intensity for South America was investigated from nine new data, three from ruins of the Guaraní Jesuit Missions and six from archaeological sites associated with jerky beef farms, both located in Rio Grande do Sul, Brazil, with ages covering the last 400 years. These data combined with the regional archaeointensity database, demonstrates that the influence of significant non-dipole components in South America started at ~1800 CE. Finally, from a reassessment of the global archaeointensity database, a new interpretation was proposed about the geomagnetic axial dipole evolution, where this component falls constantly since ~700 CE associated to the breaking of the symmetry of the advective sources operating in the outer core.
Variações temporais do campo magnético da Terra fornecem uma grande diversidade de informações geofísicas sobre a dinâmica das diferentes camadas da Terra. Por ser um campo planetário, aspectos regionais e globais podem ser explorados, dependendo da escala de tempo das variações. Nesta tese, foram investigadas as variações do campo geomagnético para os dois últimos milênios. Para isso, aprimoramentos nos métodos de aquisição da intensidade geomagnética registrada em materiais arqueológicos foram realizados, bem como a aquisição de novos dados e uma avaliação crítica da base de dados arqueomagnética global. Dois novos avanços metodológicos são aqui propostos, sendo eles: i) correção para o método de micro-ondas do efeito da taxa de resfriamento, que está associada à diferença entre os tempos de resfriamento durante a manufatura do material e o das etapas de aquecimento durante o experimento de arqueointensidade; (ii) teste para correção da anisotropia termorremanente a partir da média aritmética de seis amostras posicionadas ortogonalmente umas às outras durante o experimento de arqueointensidade. A variação temporal da intensidade magnética para a América do Sul foi investigada a partir de nove dados inéditos, sendo três provenientes das ruínas das Missões Jesuíticas Guaraníticas e seis de sítios arqueológicos associados a fazendas de charque, ambos localizados no Rio Grande do Sul, Brasil, com idades que cobrem os últimos 400 anos. Esses dados, combinados com o banco de dados regionais de arqueointensidade, demonstram que a influência significativa de componentes não-dipolares do campo magnético na América do Sul começou em ~1800 CE. Finalmente, a partir de uma reavaliação do banco de dados globais de arqueointensidade uma nova interpretação foi proposta a respeito da evolução do dipolo axial geomagnético, sugerindo que essa componente está decrescendo constantemente desde ~700 CE devido à quebra da simetria das fontes advectivas que operam no núcleo externo.
Gli stili APA, Harvard, Vancouver, ISO e altri
6

Hrabina, Martin. "VÝVOJ ALGORITMŮ PRO ROZPOZNÁVÁNÍ VÝSTŘELŮ". Doctoral thesis, Vysoké učení technické v Brně. Fakulta elektrotechniky a komunikačních technologií, 2019. http://www.nusl.cz/ntk/nusl-409087.

Testo completo
Abstract (sommario):
Táto práca sa zaoberá rozpoznávaním výstrelov a pridruženými problémami. Ako prvé je celá vec predstavená a rozdelená na menšie kroky. Ďalej je poskytnutý prehľad zvukových databáz, významné publikácie, akcie a súčasný stav veci spoločne s prehľadom možných aplikácií detekcie výstrelov. Druhá časť pozostáva z porovnávania príznakov pomocou rôznych metrík spoločne s porovnaním ich výkonu pri rozpoznávaní. Nasleduje porovnanie algoritmov rozpoznávania a sú uvedené nové príznaky použiteľné pri rozpoznávaní. Práca vrcholí návrhom dvojstupňového systému na rozpoznávanie výstrelov, monitorujúceho okolie v reálnom čase. V závere sú zhrnuté dosiahnuté výsledky a načrtnutý ďalší postup.
Gli stili APA, Harvard, Vancouver, ISO e altri
7

Khan, Md Jafar Ahmed. "Robust linear model selection for high-dimensional datasets". Thesis, University of British Columbia, 2006. http://hdl.handle.net/2429/31082.

Testo completo
Abstract (sommario):
This study considers the problem of building a linear prediction model when the number of candidate covariates is large and the dataset contains a fraction of outliers and other contaminations that are difficult to visualize and clean. We aim at predicting the future non-outlying cases. Therefore, we need methods that are robust and scalable at the same time. We consider two different strategies for model selection: (a) one-step model building and (b) two-step model building. For one-step model building, we robustify the step-by-step algorithms forward selection (FS) and stepwise (SW), with robust partial F-tests as stopping rules. Our two-step model building procedure consists of sequencing and segmentation. In sequencing, the input variables are sequenced to form a list such that the good predictors are likely to appear in the beginning, and the first m variables of the list form a reduced set for further consideration. For this step we robustify Least Angle Regression (LARS) proposed by Efron, Hastie, Johnstone and Tibshirani (2004). We use bootstrap to stabilize the results obtained by robust LARS, and use "learning curves" to determine the size of the reduced set. The second step (of the two-step model building procedure) - which we call segmentation - carefully examines subsets of the covariates in the reduced set in order to select the final prediction model. For this we propose a computationally suitable robust cross-validation procedure. We also propose a robust bootstrap procedure for segmentation, which is similar to the method proposed by Salibian-Barrera and Zamar (2002) to conduct robust inferences in linear regression. We introduce the idea of "multivariate-Winsorization" which we use for robust data cleaning (for the robustification of LARS). We also propose a new correlation estimate which we call the "adjusted-Winsorized correlation estimate". This estimate is consistent and has bounded influence, and has some advantages over univariate-Winsorized correlation estimate (Huber 1981 and Alqallaf 2003).
Science, Faculty of
Statistics, Department of
Graduate
Gli stili APA, Harvard, Vancouver, ISO e altri
8

Mo, Dengyao. "Robust and Efficient Feature Selection for High-Dimensional Datasets". University of Cincinnati / OhioLINK, 2011. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1299010108.

Testo completo
Gli stili APA, Harvard, Vancouver, ISO e altri
9

Poolsawad, Nongnuch. "Practical approaches to mining of clinical datasets : from frameworks to novel feature selection". Thesis, University of Hull, 2014. http://hydra.hull.ac.uk/resources/hull:8620.

Testo completo
Abstract (sommario):
Research has investigated clinical data that have embedded within them numerous complexities and uncertainties in the form of missing values, class imbalances and high dimensionality. The research in this thesis was motivated by these challenges to minimise these problems whilst, at the same time, maximising classification performance of data and also selecting the significant subset of variables. As such, this led to the proposal of a data mining framework and feature selection method. The proposed framework has a simple algorithmic framework and makes use of a modified form of existing frameworks to address a variety of different data issues, called the Handling Clinical Data Framework (HCDF). The assessment of data mining techniques reveals that missing values imputation and resampling data for class balancing can improve the performance of classification. Next, the proposed feature selection method was introduced; it involves projecting onto principal component method (FS-PPC) and draws on ideas from both feature extraction and feature selection to select a significant subset of features from the data. This method selects features that have high correlation with the principal component by applying symmetrical uncertainty (SU). However, irrelevant and redundant features are removed by using mutual information (MI). However, this method provides confidence in the selected subset of features that will yield realistic results with less time and effort. FS-PPC is able to retain classification performance and meaningful features while consisting of non-redundant features. The proposed methods have been practically applied to analysis of real clinical data and their effectiveness has been assessed. The results show that the proposed methods are enable to minimise the clinical data problems whilst, at the same time, maximising classification performance of data.
Gli stili APA, Harvard, Vancouver, ISO e altri
10

Kurra, Goutham. "Pattern Recognition in Large Dimensional and Structured Datasets". University of Cincinnati / OhioLINK, 2002. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1014322308.

Testo completo
Gli stili APA, Harvard, Vancouver, ISO e altri
11

Vege, Sri Harsha. "Ensemble of Feature Selection Techniques for High Dimensional Data". TopSCHOLAR®, 2012. http://digitalcommons.wku.edu/theses/1164.

Testo completo
Abstract (sommario):
Data mining involves the use of data analysis tools to discover previously unknown, valid patterns and relationships from large amounts of data stored in databases, data warehouses, or other information repositories. Feature selection is an important preprocessing step of data mining that helps increase the predictive performance of a model. The main aim of feature selection is to choose a subset of features with high predictive information and eliminate irrelevant features with little or no predictive information. Using a single feature selection technique may generate local optima. In this thesis we propose an ensemble approach for feature selection, where multiple feature selection techniques are combined to yield more robust and stable results. Ensemble of multiple feature ranking techniques is performed in two steps. The first step involves creating a set of different feature selectors, each providing its sorted order of features, while the second step aggregates the results of all feature ranking techniques. The ensemble method used in our study is frequency count which is accompanied by mean to resolve any frequency count collision. Experiments conducted in this work are performed on the datasets collected from Kent Ridge bio-medical data repository. Lung Cancer dataset and Lymphoma dataset are selected from the repository to perform experiments. Lung Cancer dataset consists of 57 attributes and 32 instances and Lymphoma dataset consists of 4027 attributes and 96 ix instances. Experiments are performed on the reduced datasets obtained from feature ranking. These datasets are used to build the classification models. Model performance is evaluated in terms of AUC (Area under Receiver Operating Characteristic Curve) performance metric. ANOVA tests are also performed on the AUC performance metric. Experimental results suggest that ensemble of multiple feature selection techniques is more effective than an individual feature selection technique.
Gli stili APA, Harvard, Vancouver, ISO e altri
12

Elsilä, U. (Ulla). "Knowledge discovery method for deriving conditional probabilities from large datasets". Doctoral thesis, University of Oulu, 2007. http://urn.fi/urn:isbn:9789514286698.

Testo completo
Abstract (sommario):
Abstract In today's world, enormous amounts of data are being collected everyday. Thus, the problems of storing, handling, and utilizing the data are faced constantly. As the human mind itself can no longer interpret the vast datasets, methods for extracting useful and novel information from the data are needed and developed. These methods are collectively called knowledge discovery methods. In this thesis, a novel combination of feature selection and data modeling methods is presented in order to help with this task. This combination includes the methods of basic statistical analysis, linear correlation, self-organizing map, parallel coordinates, and k-means clustering. The presented method can be used, first, to select the most relevant features from even hundreds of them and, then, to model the complex inter-correlations within the selected ones. The capability to handle hundreds of features opens up the possibility to study more extensive processes instead of just looking at smaller parts of them. The results of k-nearest-neighbors study show that the presented feature selection procedure is valid and appropriate. A second advantage of the presented method is the possibility to use thousands of samples. Whereas the current rules of selecting appropriate limits for utilizing the methods are theoretically proved only for small sample sizes, especially in the case of linear correlation, this thesis gives the guidelines for feature selection with thousands of samples. A third positive aspect is the nature of the results: given that the outcome of the method is a set of conditional probabilities, the derived model is highly unrestrictive and rather easy to interpret. In order to test the presented method in practice, it was applied to study two different cases of steel manufacturing with hot strip rolling. In the first case, the conditional probabilities for different types of retentions were derived and, in the second case, the rolling conditions for the occurrence of wedge were revealed. The results of both of these studies show that steel manufacturing processes are indeed very complex and highly dependent on the various stages of the manufacturing. This was further confirmed by the fact that with studies of k-nearest-neighbors and C4.5, it was impossible to derive useful models concerning the datasets as a whole. It is believed that the reason for this lies in the nature of these two methods, meaning that they are unable to grasp such manifold inter-correlations in the data. On the contrary, the presented method of conditional probabilities allowed new knowledge to be gained of the studied processes, which will help to better understand these processes and to enhance them.
Gli stili APA, Harvard, Vancouver, ISO e altri
13

Wan, Cen. "Novel hierarchical feature selection methods for classification and their application to datasets of ageing-related genes". Thesis, University of Kent, 2015. https://kar.kent.ac.uk/54761/.

Testo completo
Abstract (sommario):
Hierarchical Feature Selection (HFS) is an under-explored subarea of data mining/machine learning. Unlike conventional (flat) feature selection algorithms, HFS algorithms work by exploiting hierarchical (generalisation-specialisation) relationships between features, in order to try to improve the predictive accuracy of classifiers. The basic idea is to remove hierarchical redundancy between features, where the presence of a feature in an instance implies the presence of all ancestors of that feature in that instance. By using an HFS algorithm to select a feature subset where the hierarchical redundancy among features is eliminated or reduced, and then giving only the selected feature subset to a classification algorithm, it is possible to improve the predictive accuracy of classification algorithms. In terms of applications, this thesis focuses on datasets of ageing-related genes. This type of dataset is an interesting type of application for data mining methods due to the technical difficulty and ethical issues associated with doing ageing experiments with humans and the strategic importance of research on the biology of ageing - since age is the greatest risk factor for a number of diseases, but is still a not well understood biological process. This thesis offers contributions mainly to the area of data mining/machine learning, but also to bioinformatics and the biology of ageing, as discussed next. The first and main type of contribution consists of four novel HFS algorithms, namely: select Hierarchical Information Preserving (HIP) features, select Most Relevant (MR) features, the hybrid HIP–MR algorithm, and the Hierarchy-based Redundancy Eliminated Tree Augmented Naive Bayes (HRE–TAN) algorithm. These algorithms perform lazy learning-based feature selection - i.e. they postpone the learning process to the moment when testing instances are observed and select a specific feature subset for each testing instance. HIP, MR and HIP–MR select features in a data pre-processing phase, before running a classification algorithm, and they select features that can be used as input by any lazy classification algorithm. In contrast, HRE–TAN is a feature selection process embedded in the construction of a lazy TAN classifier. The second type of contribution, relevant to the areas of data mining and bioinformatics, consists of two novel algorithms that exploit the pre-defined structure of the Gene Ontology (GO) and the results of a flat or hierarchical feature selection algorithm to create the network topology of a Bayesian Network Augmented Naive Bayes (BAN) classifier. These are called GO–BAN algorithms. The proposed HFS algorithms were in general evaluated in combination with lazy versions of three Bayesian network classifiers, namely Naïve Bayes, TAN and GO–BAN - except that HRE–TAN works only with TAN. The experiments involved comparing the predictive accuracy obtained by these classifiers using the features selected by the proposed HFS algorithms with the predictive accuracy obtained by these classifiers using the features selected by flat feature selection algorithms, as well as the accuracy obtained by the classifiers using all original features (without feature selection) as a baseline. The experiments used a number of ageing-related datasets, where the instances being classified are genes, the predictive features are GO terms describing hierarchical gene functions, and the classes to be predicted indicate whether a gene has a pro-longevity or anti-longevity effect in the lifespan of a model organism (yeast, worm, fly or mouse). In general, with the exception of the hybrid HIP–MR which did not obtain good results, the other three proposed HFS algorithms (HIP, MR, HRE–TAN) improved the predictive performance of the baseline Bayesian network classifiers - i.e. in general the classifiers obtained higher accuracies when using only the features selected by the HFS algorithm than when using all original features. Overall, the most successful of the four HFS algorithms was HIP, which outperformed all other (hierarchical or flat) feature selection algorithms when used in combination with each of the Naive Bayes, TAN and GO–BAN classifiers. The difference of predictive accuracy between HIP and the other feature selection algorithms was almost always statistically significant - except that the difference of accuracy between HIP and MR was not significant with TAN. Comparing different combinations of a HFS algorithm and a Bayesian network classifier, HIP+NB and HIP+GO–BAN were both the best combination, with the same average rank across all datasets. They obtained predictive accuracies statistically significantly higher than the accuracies obtained by all other combinations of HFS algorithm and classifier. The third type of contribution of this thesis is a contribution to the biology of ageing. More precisely, the proposed HIP and MR algorithms were used to produce rankings of GO terms in decreasing order of their usefulness for predicting the pro-longevity or anti-longevity effect of a gene on a model organism; and the top GO terms in these rankings were interpreted with the help of a biologist expert on ageing, leading to potentially relevant patterns about the biology of ageing.
Gli stili APA, Harvard, Vancouver, ISO e altri
14

Kruczyk, Marcin. "Rule-Based Approaches for Large Biological Datasets Analysis : A Suite of Tools and Methods". Doctoral thesis, Uppsala, 2013. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-206137.

Testo completo
Abstract (sommario):
This thesis is about new and improved computational methods to analyze complex biological data produced by advanced biotechnologies. Such data is not only very large but it also is characterized by very high numbers of features. Addressing these needs, we developed a set of methods and tools that are suitable to analyze large sets of data, including next generation sequencing data, and built transparent models that may be interpreted by researchers not necessarily expert in computing. We focused on brain related diseases. The first aim of the thesis was to employ the meta-server approach to finding peaks in ChIP-seq data. Taking existing peak finders we created an algorithm that produces consensus results better than any single peak finder. The second aim was to use supervised machine learning to identify features that are significant in predictive diagnosis of Alzheimer disease in patients with mild cognitive impairment. This experience led to a development of a better feature selection method for rough sets, a machine learning method.  The third aim was to deepen the understanding of the role that STAT3 transcription factor plays in gliomas. Interestingly, we found that STAT3 in addition to being an activator is also a repressor in certain glioma rat and human models. This was achieved by analyzing STAT3 binding sites in combination with epigenetic marks. STAT3 regulation was determined using expression data of untreated cells and cells after JAK2/STAT3 inhibition. The four papers constituting the thesis are preceded by an exposition of the biological, biotechnological and computational background that provides foundations for the papers. The overall results of this thesis are witness of the mutually beneficial relationship played by Bioinformatics in modern Life Sciences and Computer Science.
Gli stili APA, Harvard, Vancouver, ISO e altri
15

Luo, Silang. "Data mining of many-attribute data : investigating the interaction between feature selection strategy and statistical features of datasets". Thesis, Heriot-Watt University, 2009. http://hdl.handle.net/10399/2276.

Testo completo
Abstract (sommario):
In many datasets, there is a very large number of attributes (e.g. many thousands). Such datasets can cause many problems for machine learning methods. Various feature selection (FS) strategies have been developed to address these problems. The idea of an FS strategy is to reduce the number of features in a dataset (e.g. from many thousands to a few hundred) so that machine learning and/or statistical analysis can be done much more quickly and effectively. Obviously, FS strategies attempt to select the features that are most important, considering the machine learning task to be done. The work presented in this dissertation concerns the comparison between several popular feature selection strategies, and, in particular, investigation of the interaction between feature selection strategy and simple statistical features of the dataset. The basic hypothesis, not investigated before, is that the correct choice of FS strategy for a particular dataset should be based on a simple (at least) statistical analysis of the dataset. First, we examined the performance of several strategies on a selection of datasets. Strategies examined were: four widely-used FS strategies (Correlation, Relief F, Evolutionary Algorithm, no-feature-selection), several feature bias (FB) strategies (in which the machine learning method considers all features, but makes use of bias values suggested by the FB strategy), and also combinations of FS and FB strategies. The results showed us that FB methods displayed strong capability on some datasets and that combined strategies were also often successful. Examining these results, we noted that patterns of performance were not immediately understandable. This led to the above hypothesis (one of the main contributions of the thesis) that statistical features of the dataset are an important consideration when choosing an FS strategy. We then investigated this hypothesis with several further experiments. Analysis of the results revealed that a simple statistical feature of a dataset, that can be easily pre-calculated, has a clear relationship with the performance Silang Luo PHD-06-2009 Page 2 of certain FS methods, and a similar relationship with differences in performance between certain pairs of FS strategies. In particular, Correlation based FS is a very widely-used FS technique based on the basic hypothesis that good feature sets contain features that are highly correlated with the class, yet uncorrelated with each other. By analysing the outcome of several FS strategies on different artificial datasets, the experiments suggest that CFS is never the best choice for poorly correlated data. Finally, considering several methods, we suggest tentative guidelines for choosing an FS strategy based on simply calculated measures of the dataset.
Gli stili APA, Harvard, Vancouver, ISO e altri
16

Fraideinberze, Antonio Canabrava. "Effective and unsupervised fractal-based feature selection for very large datasets: removing linear and non-linear attribute correlations". Universidade de São Paulo, 2017. http://www.teses.usp.br/teses/disponiveis/55/55134/tde-17112017-154451/.

Testo completo
Abstract (sommario):
Given a very large dataset of moderate-to-high dimensionality, how to mine useful patterns from it? In such cases, dimensionality reduction is essential to overcome the well-known curse of dimensionality. Although there exist algorithms to reduce the dimensionality of Big Data, unfortunately, they all fail to identify/eliminate non-linear correlations that may occur between the attributes. This MSc work tackles the problem by exploring concepts of the Fractal Theory and massive parallel processing to present Curl-Remover, a novel dimensionality reduction technique for very large datasets. Our contributions are: (a) Curl-Remover eliminates linear and non-linear attribute correlations as well as irrelevant attributes; (b) it is unsupervised and suits for analytical tasks in general not only classification; (c) it presents linear scale-up on both the data size and the number of machines used; (d) it does not require the user to guess the number of attributes to be removed, and; (e) it preserves the attributes semantics by performing feature selection, not feature extraction. We executed experiments on synthetic and real data spanning up to 1.1 billion points, and report that our proposed Curl-Remover outperformed two PCA-based algorithms from the state-of-the-art, being in average up to 8% more accurate.
Dada uma grande base de dados de dimensionalidade moderada a alta, como identificar padrões úteis nos objetos de dados? Nesses casos, a redução de dimensionalidade é essencial para superar um fenômeno conhecido na literatura como a maldição da alta dimensionalidade. Embora existam algoritmos capazes de reduzir a dimensionalidade de conjuntos de dados na escala de Terabytes, infelizmente, todos falham em relação à identificação/eliminação de correlações não lineares entre os atributos. Este trabalho de Mestrado trata o problema explorando conceitos da Teoria de Fractais e processamento paralelo em massa para apresentar Curl-Remover, uma nova técnica de redução de dimensionalidade bem adequada ao pré-processamento de Big Data. Suas principais contribuições são: (a) Curl-Remover elimina correlações lineares e não lineares entre atributos, bem como atributos irrelevantes; (b) não depende de supervisão do usuário e é útil para tarefas analíticas em geral não apenas para a classificação; (c) apresenta escalabilidade linear tanto em relação ao número de objetos de dados quanto ao número de máquinas utilizadas; (d) não requer que o usuário sugira um número de atributos para serem removidos, e; (e) mantêm a semântica dos atributos por ser uma técnica de seleção de atributos, não de extração de atributos. Experimentos foram executados em conjuntos de dados sintéticos e reais contendo até 1,1 bilhões de pontos, e a nova técnica Curl-Remover apresentou desempenho superior comparada a dois algoritmos do estado da arte baseados em PCA, obtendo em média até 8% a mais em acurácia de resultados.
Gli stili APA, Harvard, Vancouver, ISO e altri
17

Granato, Italo Stefanine Correia. "snpReady and BGGE: R packages to prepare datasets and perform genome-enabled predictions". Universidade de São Paulo, 2018. http://www.teses.usp.br/teses/disponiveis/11/11137/tde-21062018-134207/.

Testo completo
Abstract (sommario):
The use of molecular markers allows an increase in efficiency of the selection as well as better understanding of genetic resources in breeding programs. However, with the increase in the number of markers, it is necessary to process it before it can be ready to use. Also, to explore Genotype x Environment (GE) in the context of genomic prediction some covariance matrices needs to be set up before the prediction step. Thus, aiming to facilitate the introduction of genomic practices in the breeding program pipelines, we developed two R-packages. The former is called snpReady, which is set to prepare data sets to perform genomic studies. This package offers three functions to reach this objective, from organizing and apply the quality control, build the genomic relationship matrix and a summary of a population genetics. Furthermore, we present a new imputation method for missing markers. The latter is the BGGE package that was built to generate kernels for some GE genomic models and perform predictions. It consists of two functions (getK and BGGE). The former is helpful to create kernels for the GE genomic models, and the latter performs genomic predictions with some features for GE kernels that decreases the computational time. The features covered in the two packages presents a fast and straightforward option to help the introduction and usage of genome analysis in the breeding program pipeline.
O uso de marcadores moleculares permite um aumento na eficiência da seleção, bem como uma melhor compreensão dos recursos genéticos em programas de melhoramento. No entanto, com o aumento do número de marcadores, é necessário o processamento deste antes de deixa-lo disponível para uso. Além disso, para explorar a interação genótipo x ambiente (GA) no contexto da predição genômica, algumas matrizes de covariância precisam ser obtidas antes da etapa de predição. Assim, com o objetivo de facilitar a introdução de práticas genômicas nos programa de melhoramento, dois pacotes em R foram desenvolvidos. O primeiro, snpReady, foi criado para preparar conjuntos de dados para realizar estudos genômicos. Este pacote oferece três funções para atingir esse objetivo, organizando e aplicando o controle de qualidade, construindo a matriz de parentesco genômico e com estimativas de parâmetros genéticos populacionais. Além disso, apresentamos um novo método de imputação para marcas perdidas. O segundo pacote é o BGGE, criado para gerar kernels para alguns modelos genômicos de interação GA e realizar predições genômicas. Consiste em duas funções (getK e BGGE). A primeira é utilizada para criar kernels para os modelos GA, e a última realiza predições genômicas, com alguns recursos especifico para os kernels GA que diminuem o tempo computacional. Os recursos abordados nos dois pacotes apresentam uma opção rápida e direta para ajudar a introdução e uso de análises genômicas nas diversas etapas do programa de melhoramento.
Gli stili APA, Harvard, Vancouver, ISO e altri
18

Brown, Ryan Charles. "Development of Ground-Level Hyperspectral Image Datasets and Analysis Tools, and their use towards a Feature Selection based Sensor Design Method for Material Classification". Diss., Virginia Tech, 2018. http://hdl.handle.net/10919/84944.

Testo completo
Abstract (sommario):
Visual sensing in robotics, especially in the context of autonomous vehicles, has advanced quickly and many important contributions have been made in the areas of target classification. Typical to these studies is the use of the Red-Green-Blue (RGB) camera. Separately, in the field of remote sensing, the hyperspectral camera has been used to perform classification tasks on natural and man-made objects from typically aerial or satellite platforms. Hyperspectral data is characterized by a very fine spectral resolution, resulting in a significant increase in the ability to identify materials in the image. This hardware has not been studied in the context of autonomy as the sensors are large, expensive, and have non-trivial image capture times. This work presents three novel contributions: a Labeled Hyperspectral Image Dataset (LHID) of ground-level, outdoor objects based on typical scenes that a vehicle or pedestrian may encounter, an open-source hyperspectral interface software package (HSImage), and a feature selection based sensor design algorithm for object detection sensors (DLSD). These three contributions are novel and useful in the fields of hyperspectral data analysis, visual sensor design, and hyperspectral machine learning. The hyperspectral dataset and hyperspectral interface software were used in the design and testing of the sensor design algorithm. The LHID is shown to be useful for machine learning tasks through experimentation and provides a unique data source for hyperspectral machine learning. HSImage is shown to be useful for manipulating, labeling and interacting with hyperspectral data, and allows wavelength and classification based data retrieval, storage of labeling information and ambient light data. DLSD is shown to be useful for creating wavelength bands for a sensor design that increase the accuracy of classifiers trained on data from the LHID. DLSD shows accuracy near that of the full spectrum hyperspectral data, with a reduction in features on the order of 100 times. It compared favorably to other state-of-the-art wavelength feature selection techniques and exceeded the accuracy of an RGB sensor by 10%.
Ph. D.
Gli stili APA, Harvard, Vancouver, ISO e altri
19

Duncan, Andrew Paul. "The analysis and application of artificial neural networks for early warning systems in hydrology and the environment". Thesis, University of Exeter, 2014. http://hdl.handle.net/10871/17569.

Testo completo
Abstract (sommario):
Artificial Neural Networks (ANNs) have been comprehensively researched, both from a computer scientific perspective and with regard to their use for predictive modelling in a wide variety of applications including hydrology and the environment. Yet their adoption for live, real-time systems remains on the whole sporadic and experimental. A plausible hypothesis is that this may be at least in part due to their treatment heretofore as “black boxes” that implicitly contain something that is unknown, or even unknowable. It is understandable that many of those responsible for delivering Early Warning Systems (EWS) might not wish to take the risk of implementing solutions perceived as containing unknown elements, despite the computational advantages that ANNs offer. This thesis therefore builds on existing efforts to open the box and develop tools and techniques that visualise, analyse and use ANN weights and biases especially from the viewpoint of neural pathways from inputs to outputs of feedforward networks. In so doing, it aims to demonstrate novel approaches to self-improving predictive model construction for both regression and classification problems. This includes Neural Pathway Strength Feature Selection (NPSFS), which uses ensembles of ANNs trained on differing subsets of data and analysis of the learnt weights to infer degrees of relevance of the input features and so build simplified models with reduced input feature sets. Case studies are carried out for prediction of flooding at multiple nodes in urban drainage networks located in three urban catchments in the UK, which demonstrate rapid, accurate prediction of flooding both for regression and classification. Predictive skill is shown to reduce beyond the time of concentration of each sewer node, when actual rainfall is used as input to the models. Further case studies model and predict statutory bacteria count exceedances for bathing water quality compliance at 5 beaches in Southwest England. An illustrative case study using a forest fires dataset from the UCI machine learning repository is also included. Results from these model ensembles generally exhibit improved performance, when compared with single ANN models. Also ensembles with reduced input feature sets, using NPSFS, demonstrate as good or improved performance when compared with the full feature set models. Conclusions are drawn about a new set of tools and techniques, including NPSFS and visualisation techniques for inspection of ANN weights, the adoption of which it is hoped may lead to improved confidence in the use of ANN for live real-time EWS applications.
Gli stili APA, Harvard, Vancouver, ISO e altri
20

Neumann, Ursula [Verfasser], Dmitrij [Akademischer Betreuer] Frischmann, Dominik [Gutachter] Heider e Dmitrij [Gutachter] Frischmann. "Stability and Accuracy Analysis of a Feature Selection Ensemble for Binary Classification in Biomedical Datasets / Ursula Neumann ; Gutachter: Dominik Heider, Dmitrij Frischmann ; Betreuer: Dmitrij Frischmann". München : Universitätsbibliothek der TU München, 2018. http://d-nb.info/1154931641/34.

Testo completo
Gli stili APA, Harvard, Vancouver, ISO e altri
21

Yang, Zong-ming, e 楊宗明. "Applying Clonal Selection Theory in Dataset Clustering". Thesis, 2013. http://ndltd.ncl.edu.tw/handle/05851847141349430024.

Testo completo
Abstract (sommario):
碩士
國立高雄第一科技大學
資訊管理研究所
101
This thesis presents a clone selection algorithm to solve the data clustering problem. A clonal selection algorithm is primarily focused on mimicking the clonal selection theory which is composed of the mechanisms; clonal selection, clonal expansion, and affinity maturation via somatic hypermutation. The important feature of the theory is that when a cell is selected and proliferates, then subjected to cloning proportional to affinity rank, and the hypermutation of clones proportional to affinity weights. The resultant clonal-set competes with the existent antibody population for membership in the next generation. Finally ,This study incorporates Clonal Selection Theory with Particle Swarm Optimization to data clustering this UCI public dataset. In the structure of hybrid systems to prevent from early convergence in the computing process. Experimental results show that the proposed hybrid systems with high diversity improve the performance of data clustering.
Gli stili APA, Harvard, Vancouver, ISO e altri
22

Lutu, P. E. N. (Patricia Elizabeth Nalwoga). "Dataset selection for aggregate model implementation in predictive data mining". Thesis, 2010. http://hdl.handle.net/2263/29486.

Testo completo
Abstract (sommario):
Data mining has become a commonly used method for the analysis of organisational data, for purposes of summarizing data in useful ways and identifying non-trivial patterns and relationships in the data. Given the large volumes of data that are collected by business, government, non-government and scientific research organizations, a major challenge for data mining researchers and practitioners is how to select relevant data for analysis in sufficient quantities, in order to meet the objectives of a data mining task. This thesis addresses the problem of dataset selection for predictive data mining. Dataset selection was studied in the context of aggregate modeling for classification. The central argument of this thesis is that, for predictive data mining, it is possible to systematically select many dataset samples and employ different approaches (different from current practice) to feature selection, training dataset selection, and model construction. When a large amount of information in a large dataset is utilised in the modeling process, the resulting models will have a high level of predictive performance and should be more reliable. Aggregate classification models, also known as ensemble classifiers, have been shown to provide a high level of predictive accuracy on small datasets. Such models are known to achieve a reduction in the bias and variance components of the prediction error of a model. The research for this thesis was aimed at the design of aggregate models and the selection of training datasets from large amounts of available data. The objectives for the model design and dataset selection were to reduce the bias and variance components of the prediction error for the aggregate models. Design science research was adopted as the paradigm for the research. Large datasets obtained from the UCI KDD Archive were used in the experiments. Two classification algorithms: See5 for classification tree modeling and K-Nearest Neighbour, were used in the experiments. The two methods of aggregate modeling that were studied are One-Vs-All (OVA) and positive-Vs-negative (pVn) modeling. While OVA is an existing method that has been used for small datasets, pVn is a new method of aggregate modeling, proposed in this thesis. Methods for feature selection from large datasets, and methods for training dataset selection from large datasets, for OVA and pVn aggregate modeling, were studied. The experiments of feature selection revealed that the use of many samples, robust measures of correlation, and validation procedures result in the reliable selection of relevant features for classification. A new algorithm for feature subset search, based on the decision rule-based approach to heuristic search, was designed and the performance of this algorithm was compared to two existing algorithms for feature subset search. The experimental results revealed that the new algorithm makes better decisions for feature subset search. The information provided by a confusion matrix was used as a basis for the design of OVA and pVn base models which aren combined into one aggregate model. A new construct called a confusion graph was used in conjunction with new algorithms for the design of pVn base models. A new algorithm for combining base model predictions and resolving conflicting predictions was designed and implemented. Experiments to study the performance of the OVA and pVn aggregate models revealed the aggregate models provide a high level of predictive accuracy compared to single models. Finally, theoretical models to depict the relationships between the factors that influence feature selection and training dataset selection for aggregate models are proposed, based on the experimental results.
Thesis (PhD)--University of Pretoria, 2010.
Computer Science
unrestricted
Gli stili APA, Harvard, Vancouver, ISO e altri
23

Chen, Yen-Tze, e 陳彥澤. "Comparison of single feature selection and fusion feature selection for medical dataset". Thesis, 2018. http://ndltd.ncl.edu.tw/handle/v55952.

Testo completo
Abstract (sommario):
碩士
元智大學
資訊工程學系
106
In the era of rapid development of information technology, there are nearly tens of thousands of documents generated every day. Accumulation of a large amount of information has been seen everywhere. Therefore, businesses now value the data of users and predict future consumers through a data analysis software. The preference, in turn, puts the top-selling merchandise in the mall in a prominent position, and conversely, the poorly-selling item matching sales promotion program increases sales. Big data, or huge data, refers to the fact that when data is so large that database systems cannot store, compute, and process data within an effective period of time, and analysis becomes information that can be interpreted, it is called big data. Therefore, in recent years, Experts in the field are committed to solving the problem of too large data and hope to complete the analysis in a limited time. The research area of this thesis is feature selection. The main function of feature selection is to delete redundant or repetitive data in the dataset, thereby reducing the complexity and time of the analysis. In addition, the research topic is to explore the selection and integration of single features. Differences between feature selections are used to illustrate whether fusion feature selection has a better prediction model accuracy than single feature selection. As far as practical application is concerned, UCI Machine Learning Repository's and KDD 2008 medical data set are selected as the initial data for this experiment. Differences between classification techniques in medical statistics and classification techniques in the field of information.
Gli stili APA, Harvard, Vancouver, ISO e altri
24

Fu, JuiHsi, e 傅瑞曦. "Sample Selection on Labeling Imbalanced Datasets and Learning Efficient Classifiers". Thesis, 2013. http://ndltd.ncl.edu.tw/handle/60300828589932265351.

Testo completo
Abstract (sommario):
博士
國立中正大學
資訊工程研究所
101
When building a classification system, two practical issues should be carefully concerned. Firstly, it is difficult to collect a complete dataset in a short period of time. Secondly, it is expensive to label collected data by human effort. In this thesis, we study further research issues in active learning which aims to label informative samples and in incremental learning which generates the classifier using sequential datasets. Thus we concentrate on designing approach to label imbalanced datasets and to learn efficient classifiers. Our main concept is to select informative samples used for labeling data or for adjusting classifiers. Our active learning approaches aim to query unlabeled samples without being affected by the imbalanced classification problem. They select the specified labeled samples to determine whether an unlabeled sample is queried or not. Moreover, the objective of our incremental learning approaches is to select informative samples to efficiently adjust the classifier. Those samples could be misclassified or classified in low confidence. We also concern that the dataset which is sequentially collected is still insufficient. In this condition, we select labeled samples that are relevant to generate specific classifiers for the target sample. In our experiments, approaches are evaluated on synthetic datasets and some real-world datasets from UCI repository and the campus of National Chung Cheng University. Through the experimental results and theoretical analysis, it is presented that our approaches have the abilities of effectively handling the practical issues in labeling data and adjusting classifiers.
Gli stili APA, Harvard, Vancouver, ISO e altri
25

YEH, CHENG-HUA, e 葉政華. "A Study on Gene Selection and Classification with Microarray Datasets". Thesis, 2003. http://ndltd.ncl.edu.tw/handle/26638596245660080051.

Testo completo
Abstract (sommario):
碩士
國立臺灣大學
資訊工程學研究所
91
This thesis discusses two essential issues in Microarray data analysis: gene selection in tumor classification and learning gene functional classes. The first issue concerns how to select the informative genes with respect to a specific classification problem from a large number of genes in Microarray dataset. This thesis proposes two clustering-based methods. Experimental results reveal that the proposed methods are able to identify a set of informative genes, when applied to a challenging tumor dataset. The second issue studied in this thesis is aimed at identifying correlations between clusters of co-expression genes in Microarray dataset and co-regulated cell activities. This thesis investigates the effects of exploiting supervised learning algorithms to deal with this problem. Experimental results show that the novel RBF networks based learning algorithm lately proposed by our research team and the support vector machines (SVM) can deliver far better results than the other well-known approaches included in this study. Nevertheless, experimental results also show that the supervised learning based approach can successfully be applied to only a few classes of co-regulated genes. In response to this observation, this thesis proposes two methods to improve the recall rates of the supervised learning based approach.
Gli stili APA, Harvard, Vancouver, ISO e altri
26

Syu, Jen-Hui, e 徐仁徽. "Target Genes Selection for Human Colon Cancer Datasets Based on Data Mining Algorithm". Thesis, 2013. http://ndltd.ncl.edu.tw/handle/74213282671290750613.

Testo completo
Abstract (sommario):
碩士
國立中興大學
基因體暨生物資訊學研究所
101
In 2013, the ministry of health and welfare announcement ten leading causes of death. Malignancies is the first of the top ten causes of deaths for 31 years running. Because of the colon cancer, rectum cancer and anal cancer deaths have 5131 people. Colon cancer is the first most commonly diagnosed cancer in the Taiwan, and colon cancer is the third most common type of cancer in both sexes. When recognized at this early stage, they are often reversible. If the patients can early detection of colon cancer, life expectancy would lengthen if colon cancer risks are tackled. The risk of developing colon cancer increases with advancing age. Many risk Factors that increases the chance of getting a colon cancer. For example dietary habit, gender, family history and hereditary factors etc. Many factors might give rise to colon cancer risk. In the last few years, the average age of patient is lowering. This study have developed a novel system target genes, prediction analysis and medical data mining system. Which the use of target genes to build a database, integration of artificial neural networks and decision tree analysis techniques. Further Studies in colon cancer patients of target genes. Expect to establish the diagnostic system, reduce the misdiagnosis rate and reduce health care costs. Our experimental result show that the accuracy rate of back-propagation neural network and random tree are 62% and 95.6%. Finally, this thesis is to build a visual web interface systems, hoping to make medical staff in the diagnostic process more intuitive and fast observed gene expression. This study hope genes problem solving and medical decision making, motivated by efforts to improve human health.
Gli stili APA, Harvard, Vancouver, ISO e altri
27

HENDRICK e 林高民. "Using Biological Feature Selection Approach on Large-Scale Integrated Microarray Datasets for Colorectal Cancer Prediction". Thesis, 2017. http://ndltd.ncl.edu.tw/handle/phj8te.

Testo completo
Abstract (sommario):
碩士
慈濟大學
醫學資訊學系碩士班
105
Colorectal cancer is one of the most common cancer with the fourth highest mortality rate in the world. The microarray can be used to gather information from tissue samples regarding gene expression differences that will be useful in diagnosing colorectal cancer from the molecular level. Highly accurate prediction method is needed to deal with the high incidence number of colorectal cancer. However, this method is still facing challenges, starting from selecting appropriate features, the number of selected feature until classification algorithm. Therefore, large-scale studies, the comparison between studies, and selecting informative data are very crucial to make this method can be applied clinically. Here, we propose a systematic method, starting from data collection, data preprocessing, data merging, feature selection until classification for colorectal cancer prediction by implementing the large-scale integrated microarray datasets along with the biological feature selection to overcome the highly accurate prediction method that can be used clinically. We integrated (i) 31 curated colorectal microarray datasets with 2443 cancer and 361 normal samples, (ii) variance stabilization normalization as the preprocessing method, (iii) empirical Bayes as the batch effect removal method with 2 factors –batch and phenotype info-, (iv) biological feature selection by using gene set enrichment analysis and gene ontology enrichment analysis based on gene functional biological knowledge, and (v) support vector machine as the classification algorithm simultaneously. As the results, our method provides the more reliable prediction result without losing its high accurate prediction result (around 98% accuracy) and finds the correlated genes in inflammatory response playing the important role in the development of adenomatous polyps that can lead to colorectal cancer. In addition, our method also can be used clinically by providing the large sample size number along with large comparison from different microarray studies.
Gli stili APA, Harvard, Vancouver, ISO e altri
Offriamo sconti su tutti i piani premium per gli autori le cui opere sono incluse in raccolte letterarie tematiche. Contattaci per ottenere un codice promozionale unico!

Vai alla bibliografia