To see the other types of publications on this topic, follow the link: Dataset selection.

Journal articles on the topic 'Dataset selection'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 journal articles for your research on the topic 'Dataset selection.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse journal articles on a wide variety of disciplines and organise your bibliography correctly.

1

Kumar, H. M. Keerthi, and B. S. Harish. "A New Feature Selection Method for Sentiment Analysis in Short Text." Journal of Intelligent Systems 29, no. 1 (December 4, 2018): 1122–34. http://dx.doi.org/10.1515/jisys-2018-0171.

Full text
Abstract:
Abstract In recent internet era, micro-blogging sites produce enormous amount of short textual information, which appears in the form of opinions or sentiments of users. Sentiment analysis is a challenging task in short text, due to use of formal language, misspellings, and shortened forms of words, which leads to high dimensionality and sparsity. In order to deal with these challenges, this paper proposes a novel, simple, and yet effective feature selection method, to select frequently distributed features related to each class. In this paper, the feature selection method is based on class-wise information, to identify the relevant feature related to each class. We evaluate the proposed feature selection method by comparing with existing feature selection methods like chi-square ( χ2), entropy, information gain, and mutual information. The performances are evaluated using classification accuracy obtained from support vector machine, K nearest neighbors, and random forest classifiers on two publically available datasets viz., Stanford Twitter dataset and Ravikiran Janardhana dataset. In order to demonstrate the effectiveness of the proposed feature selection method, we conducted extensive experimentation by selecting different feature sets. The proposed feature selection method outperforms the existing feature selection methods in terms of classification accuracy on the Stanford Twitter dataset. Similarly, the proposed method performs competently equally in terms of classification accuracy compared to other feature selection methods in most of the feature subsets on Ravikiran Janardhana dataset.
APA, Harvard, Vancouver, ISO, and other styles
2

Endalie, Demeke, and Getamesay Haile. "Hybrid Feature Selection for Amharic News Document Classification." Mathematical Problems in Engineering 2021 (March 11, 2021): 1–8. http://dx.doi.org/10.1155/2021/5516262.

Full text
Abstract:
Today, the amount of Amharic digital documents has grown rapidly. Because of this, automatic text classification is extremely important. Proper selection of features has a crucial role in the accuracy of classification and computational time. When the initial feature set is considerably larger, it is important to pick the right features. In this paper, we present a hybrid feature selection method, called IGCHIDF, which consists of information gain (IG), chi-square (CHI), and document frequency (DF) features’ selection methods. We evaluate the proposed feature selection method on two datasets: dataset 1 containing 9 news categories and dataset 2 containing 13 news categories. Our experimental results showed that the proposed method performs better than other methods on both datasets 1and 2. The IGCHIDF method’s classification accuracy is up to 3.96% higher than the IG method, up to 11.16% higher than CHI, and 7.3% higher than DF on dataset 2, respectively.
APA, Harvard, Vancouver, ISO, and other styles
3

Peter, Timm J., and Oliver Nelles. "Fast and simple dataset selection for machine learning." at - Automatisierungstechnik 67, no. 10 (October 25, 2019): 833–42. http://dx.doi.org/10.1515/auto-2019-0010.

Full text
Abstract:
Abstract The task of data reduction is discussed and a novel selection approach which allows to control the optimal point distribution of the selected data subset is proposed. The proposed approach utilizes the estimation of probability density functions (pdfs). Due to its structure, the new method is capable of selecting a subset either by approximating the pdf of the original dataset or by approximating an arbitrary, desired target pdf. The new strategy evaluates the estimated pdfs solely on the selected data points, resulting in a simple and efficient algorithm with low computational and memory demand. The performance of the new approach is investigated for two different scenarios. For representative subset selection of a dataset, the new approach is compared to a recently proposed, more complex method and shows comparable results. For the demonstration of the capability of matching a target pdf, a uniform distribution is chosen as an example. Here the new method is compared to strategies for space-filling design of experiments and shows convincing results.
APA, Harvard, Vancouver, ISO, and other styles
4

Perez-Alvarez, Susana, Guadalupe Gómez, and Christian Brander. "FARMS: A New Algorithm for Variable Selection." BioMed Research International 2015 (2015): 1–11. http://dx.doi.org/10.1155/2015/319797.

Full text
Abstract:
Large datasets including an extensive number of covariates are generated these days in many different situations, for instance, in detailed genetic studies of outbreed human populations or in complex analyses of immune responses to different infections. Aiming at informing clinical interventions or vaccine design, methods for variable selection identifying those variables with the optimal prediction performance for a specific outcome are crucial. However, testing for all potential subsets of variables is not feasible and alternatives to existing methods are needed. Here, we describe a new method to handle such complex datasets, referred to as FARMS, that combines forward and all subsets regression for model selection. We apply FARMS to a host genetic and immunological dataset of over 800 individuals from Lima (Peru) and Durban (South Africa) who were HIV infected and tested for antiviral immune responses. This dataset includes more than 500 explanatory variables: around 400 variables with information on HIV immune reactivity and around 100 individual genetic characteristics. We have implemented FARMS inRstatistical language and we showed that FARMS is fast and outcompetes other comparable commonly used approaches, thus providing a new tool for the thorough analysis of complex datasets without the need for massive computational infrastructure.
APA, Harvard, Vancouver, ISO, and other styles
5

Dash, Ch Sanjeev Kumar, Ajit Kumar Behera, Sarat Chandra Nayak, Satchidananda Dehuri, and Sung-Bae Cho. "An Integrated CRO and FLANN Based Classifier for a Non-Imputed and Inconsistent Dataset." International Journal on Artificial Intelligence Tools 28, no. 03 (May 2019): 1950013. http://dx.doi.org/10.1142/s0218213019500131.

Full text
Abstract:
This paper presents an integrated approach by considering chemical reaction optimization (CRO) and functional link artificial neural networks (FLANNs) for building a classifier from the dataset with missing value, inconsistent records, and noisy instances. Here, imputation is carried out based on the known value of two nearest neighbors to address dataset plagued with missing values. The probabilistic approach is used to remove the inconsistency from either of the datasets like original or imputed. The resulting dataset is then given as an input to boosted instance selection approach for selection of relevant instances to reduce the size of the dataset without loss of generality and compromising classification accuracy. Finally, the transformed dataset (i.e., from non-imputed and inconsistent dataset to imputed and consistent dataset) is used for developing a classifier based on CRO trained FLANN. The method is evaluated extensively through a few bench-mark datasets obtained from University of California, Irvine (UCI) repository. The experimental results confirm that our preprocessing tasks along with integrated approach can be a promising alternative tool for mitigating missing value, inconsistent records, and noisy instances.
APA, Harvard, Vancouver, ISO, and other styles
6

Jamjoom, Mona. "The pertinent single-attribute-based classifier for small datasets classification." International Journal of Electrical and Computer Engineering (IJECE) 10, no. 3 (June 1, 2020): 3227. http://dx.doi.org/10.11591/ijece.v10i3.pp3227-3234.

Full text
Abstract:
Classifying a dataset using machine learning algorithms can be a big challenge when the target is a small dataset. The OneR classifier can be used for such cases due to its simplicity and efficiency. In this paper, we revealed the power of a single attribute by introducing the pertinent single-attribute-based-heterogeneity-ratio classifier (SAB-HR) that used a pertinent attribute to classify small datasets. The SAB-HR’s used feature selection method, which used the Heterogeneity-Ratio (H-Ratio) measure to identify the most homogeneous attribute among the other attributes in the set. Our empirical results on 12 benchmark datasets from a UCI machine learning repository showed that the SAB-HR classifier significantly outperformed the classical OneR classifier for small datasets. In addition, using the H-Ratio as a feature selection criterion for selecting the single attribute was more effectual than other traditional criteria, such as Information Gain (IG) and Gain Ratio (GR).
APA, Harvard, Vancouver, ISO, and other styles
7

Dif, Nassima, and Zakaria Elberrichi. "An Enhanced Recursive Firefly Algorithm for Informative Gene Selection." International Journal of Swarm Intelligence Research 10, no. 2 (April 2019): 21–33. http://dx.doi.org/10.4018/ijsir.2019040102.

Full text
Abstract:
Feature selection is the process of identifying good performing combinations of significant features among many possibilities. This preprocess improves the classification accuracy and facilitates the learning task. For this optimization problem, the authors have used a metaheuristics approach. Their main objective is to propose an enhanced version of the firefly algorithm as a wrapper approach by adding a recursive behavior to improve the search of the optimal solution. They applied SVM classifier to investigate the proposed method. For the authors experimentations, they have used the benchmark microarray datasets. The results show that the new enhanced recursive FA (RFA) outperforms the standard version with a reduction of dimensionality for all the datasets. As an example, for the leukemia microarray dataset, they have a perfect performance score of 100% with only 18 informative selected genes among the 7,129 of the original dataset. The RFA was competitive compared to other state-of-art approaches and achieved the best results for CNS, Ovarian cancer, MLL, prostate, Leukemia_4c, and lymphoma datasets.
APA, Harvard, Vancouver, ISO, and other styles
8

Omara, Hicham, Mohamed Lazaar, and Youness Tabii. "Effect of Feature Selection on Gene Expression Datasets Classification Accurac." International Journal of Electrical and Computer Engineering (IJECE) 8, no. 5 (October 1, 2018): 3194. http://dx.doi.org/10.11591/ijece.v8i5.pp3194-3203.

Full text
Abstract:
<span>Feature selection attracts researchers who deal with machine learning and data mining. It consists of selecting the variables that have the greatest impact on the dataset classification, and discarding the rest. This dimentionality reduction allows classifiers to be fast and more accurate. This paper traits the effect of feature selection on the accuracy of widely used classifiers in literature. These classifiers are compared with three real datasets which are pre-processed with feature selection methods. More than 9% amelioration in classification accuracy is observed, and k-means appears to be the most sensitive classifier to feature selection.</span>
APA, Harvard, Vancouver, ISO, and other styles
9

ROCCO, C. B. V., R. L. SILVA, O. C. JUNIOR, and M. RUDEK. "SELEÇÃO DE SOFTWARE BASEADA EM AHP PARA CRIAÇÃO DE DATASET SINTÉTICO 3D." Revista SODEBRAS 15, no. 176 (August 2020): 50–55. http://dx.doi.org/10.29367/issn.1809-3957.15.2020.176.50.

Full text
APA, Harvard, Vancouver, ISO, and other styles
10

Devaraj, Senthilkumar, and S. Paulraj. "An Efficient Feature Subset Selection Algorithm for Classification of Multidimensional Dataset." Scientific World Journal 2015 (2015): 1–9. http://dx.doi.org/10.1155/2015/821798.

Full text
Abstract:
Multidimensional medical data classification has recently received increased attention by researchers working on machine learning and data mining. In multidimensional dataset (MDD) each instance is associated with multiple class values. Due to its complex nature, feature selection and classifier built from the MDD are typically more expensive or time-consuming. Therefore, we need a robust feature selection technique for selecting the optimum single subset of the features of the MDD for further analysis or to design a classifier. In this paper, an efficient feature selection algorithm is proposed for the classification of MDD. The proposed multidimensional feature subset selection (MFSS) algorithm yields a unique feature subset for further analysis or to build a classifier and there is a computational advantage on MDD compared with the existing feature selection algorithms. The proposed work is applied to benchmark multidimensional datasets. The number of features was reduced to 3% minimum and 30% maximum by using the proposed MFSS. In conclusion, the study results show that MFSS is an efficient feature selection algorithm without affecting the classification accuracy even for the reduced number of features. Also the proposed MFSS algorithm is suitable for both problem transformation and algorithm adaptation and it has great potentials in those applications generating multidimensional datasets.
APA, Harvard, Vancouver, ISO, and other styles
11

Hu, Yue, Ge Peng, Zehua Wang, Yanrong Cui, and Hang Qin. "Partition Selection for Large-Scale Data Management Using KNN Join Processing." Mathematical Problems in Engineering 2020 (September 8, 2020): 1–14. http://dx.doi.org/10.1155/2020/7898230.

Full text
Abstract:
For the data processing with increasing avalanche under large datasets, the k nearest neighbors (KNN) algorithm is a particularly expensive operation for both classification and regression predictive problems. To predict the values of new data points, it can calculate the feature similarity between each object in the test dataset and each object in the training dataset. However, due to expensive computational cost, the single computer is out of work to deal with large-scale dataset. In this paper, we propose an adaptive vKNN algorithm, which adopts on the Voronoi diagram under the MapReduce parallel framework and makes full use of the advantages of parallel computing in processing large-scale data. In the process of partition selection, we design a new predictive strategy for sample point to find the optimal relevant partition. Then, we can effectively collect irrelevant data, reduce KNN join computation, and improve the operation efficiency. Finally, we use a large number of 54-dimensional datasets to conduct a large number of experiments on the cluster. The experimental results show that our proposed method is effective and scalable with ensuring accuracy.
APA, Harvard, Vancouver, ISO, and other styles
12

Ismi, Dewi Pramudi, Shireen Panchoo, and Murinto Murinto. "K-means clustering based filter feature selection on high dimensional data." International Journal of Advances in Intelligent Informatics 2, no. 1 (March 31, 2016): 38. http://dx.doi.org/10.26555/ijain.v2i1.54.

Full text
Abstract:
With hundreds or thousands of features in high dimensional data, computational workload is challenging. In classification process, features which do not contribute significantly to prediction of classes, add to the computational workload. Therefore the aim of this paper is to use feature selection to decrease the computation load by reducing the size of high dimensional data. Selecting subsets of features which represent all features were used. Hence the process is two-fold; discarding irrelevant data and choosing one feature that representing a number of redundant features. There have been many studies regarding feature selection, for example backward feature selection and forward feature selection. In this study, a k-means clustering based feature selection is proposed. It is assumed that redundant features are located in the same cluster, whereas irrelevant features do not belong to any clusters. In this research, two different high dimensional datasets are used: 1) the Human Activity Recognition Using Smartphones (HAR) Dataset, containing 7352 data points each of 561 features and 2) the National Classification of Economic Activities Dataset, which contains 1080 data points each of 857 features. Both datasets provide class label information of each data point. Our experiment shows that k-means clustering based feature selection can be performed to produce subset of features. The latter returns more than 80% accuracy of classification result.
APA, Harvard, Vancouver, ISO, and other styles
13

Hairani, Hairani, and Muhammad Innuddin. "Kombinasi Metode Correlated Naive Bayes dan Metode Seleksi Fitur Wrapper untuk Klasifikasi Data Kesehatan." Jurnal Teknik Elektro 11, no. 2 (April 27, 2020): 50–55. http://dx.doi.org/10.15294/jte.v11i2.23693.

Full text
Abstract:
Most features of health data that have many irrelevant features can reduce the performance of classification method. One health data that has many attributes is the Pima Indian Diabetes dataset and Thyroid. Diabetes is a deadly disease caused by the increasing of blood sugar because of the body's inability to produce enough insulin and its complications can lead to heart attacks and strokes. The purpose of this research is to do a combination of Correlated Naïve Bayes method and Wrapper-based feature selection to classification of health data. The stages of this research consist of several stages, namely; (1) the collection of Pima Indian Diabetes and Thyroid dataset from UCI Machine Learning Repository, (2) pre-processing data such as transformation, Scaling, and Wrapper-based feature selection, (3) classification using the Correlated Naive Bayes and Naive Bayes methods, and (4) performance test based on its accuracy using the 10-fold cross validation method. Based on the results, the combination of Correlated Naive Bayes method and Wrapper-based feature selection get the best accuracy for both datasets used. For Pima Indian Diabetes dataset, the accuracy is 71,4% and the Thyroid dataset accuracy is 79,38%. Thus, the combination of Correlated Naïve Bayes method and Wrapper-based feature selection result in better accuracy without feature selection with an increase of 4,1% for Pima Indian Diabetes dataset and 0,48% for the Thyroid dataset.
APA, Harvard, Vancouver, ISO, and other styles
14

Paramita, Adi Suryaputra. "Improving K-NN Internet Traffic Classification Using Clustering and Principle Component Analysis." Bulletin of Electrical Engineering and Informatics 6, no. 2 (June 1, 2017): 159–65. http://dx.doi.org/10.11591/eei.v6i2.608.

Full text
Abstract:
K-Nearest Neighbour (K-NN) is one of the popular classification algorithm, in this research K-NN use to classify internet traffic, the K-NN is appropriate for huge amounts of data and have more accurate classification, K-NN algorithm has a disadvantages in computation process because K-NN algorithm calculate the distance of all existing data in dataset. Clustering is one of the solution to conquer the K-NN weaknesses, clustering process should be done before the K-NN classification process, the clustering process does not need high computing time to conqest the data which have same characteristic, Fuzzy C-Mean is the clustering algorithm used in this research. The Fuzzy C-Mean algorithm no need to determine the first number of clusters to be formed, clusters that form on this algorithm will be formed naturally based datasets be entered. The Fuzzy C-Mean has weakness in clustering results obtained are frequently not same even though the input of dataset was same because the initial dataset that of the Fuzzy C-Mean is less optimal, to optimize the initial datasets needs feature selection algorithm. Feature selection is a method to produce an optimum initial dataset Fuzzy C-Means. Feature selection algorithm in this research is Principal Component Analysis (PCA). PCA can reduce non significant attribute or feature to create optimal dataset and can improve performance for clustering and classification algorithm. The resultsof this research is the combination method of classification, clustering and feature selection of internet traffic dataset was successfully modeled internet traffic classification method that higher accuracy and faster performance.
APA, Harvard, Vancouver, ISO, and other styles
15

Prasetiyowati, Maria Irmina, Nur Ulfa Maulidevi, and Kridanto Surendro. "Feature selection to increase the random forest method performance on high dimensional data." International Journal of Advances in Intelligent Informatics 6, no. 3 (November 6, 2020): 303. http://dx.doi.org/10.26555/ijain.v6i3.471.

Full text
Abstract:
Random Forest is a supervised classification method based on bagging (Bootstrap aggregating) Breiman and random selection of features. The choice of features randomly assigned to the Random Forest makes it possible that the selected feature is not necessarily informative. So it is necessary to select features in the Random Forest. The purpose of choosing this feature is to select an optimal subset of features that contain valuable information in the hope of accelerating the performance of the Random Forest method. Mainly for the execution of high-dimensional datasets such as the Parkinson, CNAE-9, and Urban Land Cover dataset. The feature selection is done using the Correlation-Based Feature Selection method, using the BestFirst method. Tests were carried out 30 times using the K-Cross Fold Validation value of 10 and dividing the dataset into 70% training and 30% testing. The experiments using the Parkinson dataset obtained a time difference of 0.27 and 0.28 seconds faster than using the Random Forest method without feature selection. Likewise, the trials in the Urban Land Cover dataset had 0.04 and 0.03 seconds, while for the CNAE-9 dataset, the difference time was 2.23 and 2.81 faster than using the Random Forest method without feature selection. These experiments showed that the Random Forest processes are faster when using the first feature selection. Likewise, the accuracy value increased in the two previous experiments, while only the CNAE-9 dataset experiment gets a lower accuracy. This research’s benefits is by first performing feature selection steps using the Correlation-Base Feature Selection method can increase the speed of performance and accuracy of the Random Forest method on high-dimensional data.
APA, Harvard, Vancouver, ISO, and other styles
16

Akinyelu, Andronicus A., and Aderemi O. Adewumi. "Improved Instance Selection Methods for Support Vector Machine Speed Optimization." Security and Communication Networks 2017 (2017): 1–11. http://dx.doi.org/10.1155/2017/6790975.

Full text
Abstract:
Support vector machine (SVM) is one of the top picks in pattern recognition and classification related tasks. It has been used successfully to classify linearly separable and nonlinearly separable data with high accuracy. However, in terms of classification speed, SVMs are outperformed by many machine learning algorithms, especially, when massive datasets are involved. SVM classification speed scales linearly with number of support vectors, and support vectors increase with increase in dataset size. Hence, SVM classification speed can be enormously reduced if it is trained on a reduced dataset. Instance selection techniques are one of the most effective techniques suitable for minimizing SVM training time. In this study, two instance selection techniques suitable for identifying relevant training instances are proposed. The techniques are evaluated on a dataset containing 4000 emails and results obtained compared to other existing techniques. Result reveals excellent improvement in SVM classification speed.
APA, Harvard, Vancouver, ISO, and other styles
17

Saxena, Amit, Shreya Pare, Mahendra Singh Meena, Deepak Gupta, Akshansh Gupta, Imran Razzak, Chin-Teng Lin, and Mukesh Prasad. "A Two-Phase Approach for Semi-Supervised Feature Selection." Algorithms 13, no. 9 (August 31, 2020): 215. http://dx.doi.org/10.3390/a13090215.

Full text
Abstract:
This paper proposes a novel approach for selecting a subset of features in semi-supervised datasets where only some of the patterns are labeled. The whole process is completed in two phases. In the first phase, i.e., Phase-I, the whole dataset is divided into two parts: The first part, which contains labeled patterns, and the second part, which contains unlabeled patterns. In the first part, a small number of features are identified using well-known maximum relevance (from first part) and minimum redundancy (whole dataset) based feature selection approaches using the correlation coefficient. The subset of features from the identified set of features, which produces a high classification accuracy using any supervised classifier from labeled patterns, is selected for later processing. In the second phase, i.e., Phase-II, the patterns belonging to the first and second part are clustered separately into the available number of classes of the dataset. In the clusters of the first part, take the majority of patterns belonging to a cluster as the class for that cluster, which is given already. Form the pairs of cluster centroids made in the first and second part. The centroid of the second part nearest to a centroid of the first part will be paired. As the class of the first centroid is known, the same class can be assigned to the centroid of the cluster of the second part, which is unknown. The actual class of the patterns if known for the second part of the dataset can be used to test the classification accuracy of patterns in the second part. The proposed two-phase approach performs well in terms of classification accuracy and number of features selected on the given benchmarked datasets.
APA, Harvard, Vancouver, ISO, and other styles
18

Dias, Lucas V., Péricles B. C. Miranda, André C. A. Nascimento, Filipe R. Cordeiro, Rafael Ferreira Mello, and Ricardo B. C. Prudêncio. "ImageDataset2Vec: An image dataset embedding for algorithm selection." Expert Systems with Applications 180 (October 2021): 115053. http://dx.doi.org/10.1016/j.eswa.2021.115053.

Full text
APA, Harvard, Vancouver, ISO, and other styles
19

Chakraborty, Tanujit. "Imbalanced Ensemble Classifier for Learning from Imbalanced Business School Dataset." International Journal of Mathematical, Engineering and Management Sciences 4, no. 4 (August 1, 2019): 861–69. http://dx.doi.org/10.33889/ijmems.2019.4.4-068.

Full text
Abstract:
Private business schools in India face a regular problem of picking quality students for their MBA programs to achieve the desired placement percentage. Generally, such datasets are biased towards one class, i.e., imbalanced in nature. And learning from the imbalanced dataset is a difficult proposition. This paper proposes an imbalanced ensemble classifier which can handle the imbalanced nature of the dataset and achieves higher accuracy in case of the feature selection (selection of important characteristics of students) cum classification problem (prediction of placements based on the students’ characteristics) for Indian business school dataset. The optimal value of an important model parameter is found. Experimental evidence is also provided using Indian business school dataset to evaluate the outstanding performance of the proposed imbalanced ensemble classifier.
APA, Harvard, Vancouver, ISO, and other styles
20

Naz, Mehreen, Kashif Zafar, and Ayesha Khan. "Ensemble Based Classification of Sentiments Using Forest Optimization Algorithm." Data 4, no. 2 (May 23, 2019): 76. http://dx.doi.org/10.3390/data4020076.

Full text
Abstract:
Feature subset selection is a process to choose a set of relevant features from a high dimensionality dataset to improve the performance of classifiers. The meaningful words extracted from data forms a set of features for sentiment analysis. Many evolutionary algorithms, like the Genetic Algorithm (GA) and Particle Swarm Optimization (PSO), have been applied to feature subset selection problem and computational performance can still be improved. This research presents a solution to feature subset selection problem for classification of sentiments using ensemble-based classifiers. It consists of a hybrid technique of minimum redundancy and maximum relevance (mRMR) and Forest Optimization Algorithm (FOA)-based feature selection. Ensemble-based classification is implemented to optimize the results of individual classifiers. The Forest Optimization Algorithm as a feature selection technique has been applied to various classification datasets from the UCI machine learning repository. The classifiers used for ensemble methods for UCI repository datasets are the k-Nearest Neighbor (k-NN) and Naïve Bayes (NB). For the classification of sentiments, 15–20% improvement has been recorded. The dataset used for classification of sentiments is Blitzer’s dataset consisting of reviews of electronic products. The results are further improved by ensemble of k-NN, NB, and Support Vector Machine (SVM) with an accuracy of 95% for the classification of sentiment tasks.
APA, Harvard, Vancouver, ISO, and other styles
21

Yao, Shu-Nung, Tim Collins, and Chaoyun Liang. "Head-Related Transfer Function Selection Using Neural Networks." Archives of Acoustics 42, no. 3 (September 26, 2017): 365–73. http://dx.doi.org/10.1515/aoa-2017-0038.

Full text
Abstract:
AbstractIn binaural audio systems, for an optimal virtual acoustic space a set of head-related transfer functions (HRTFs) should be used that closely matches the listener’s ones. This study aims to select the most appropriate HRTF dataset from a large database for users without the need for extensive listening tests. Currently, there is no way to reliably reduce the number of datasets to a smaller, more manageable number without risking discarding potentially good matches. A neural network that estimates the appropriateness of HRTF datasets based on input vectors of anthropometric measurements is proposed. The shapes and sizes of listeners’ heads and pinnas were measured using digital photography; the measured anthropometric parameters form the feature vectors used by the neural network. A graphical user interface (GUI) was developed for participants to listen to music transformed using different HRTFs and to evaluate the fitness of each HRTF dataset. The listening scores recorded were the target outputs used to train the neural networks. The aim was to learn a mapping between anthropometric parameters and listener’s perception scores. Experimental validations were performed on 30 subjects. It is demonstrated that the proposed system produces a much more reliable HRTF selection than previously used methods.
APA, Harvard, Vancouver, ISO, and other styles
22

YU, HUI, KANG TU, LU XIE, and YUAN-YUAN LI. "DIGOUT: VIEWING DIFFERENTIAL EXPRESSION GENES AS OUTLIERS." Journal of Bioinformatics and Computational Biology 08, supp01 (December 2010): 161–75. http://dx.doi.org/10.1142/s0219720010005208.

Full text
Abstract:
With regards to well-replicated two-conditional microarray datasets, the selection of differentially expressed (DE) genes is a well-studied computational topic, but for multi-conditional microarray datasets with limited or no replication, the same task is not properly addressed by previous studies. This paper adopts multivariate outlier analysis to analyze replication-lacking multi-conditional microarray datasets, finding that it performs significantly better than the widely used limit fold change (LFC) model in a simulated comparative experiment. Compared with the LFC model, the multivariate outlier analysis also demonstrates improved stability against sample variations in a series of manipulated real expression datasets. The reanalysis of a real non-replicated multi-conditional expression dataset series leads to satisfactory results. In conclusion, a multivariate outlier analysis algorithm, like DigOut, is particularly useful for selecting DE genes from non-replicated multi-conditional gene expression dataset.
APA, Harvard, Vancouver, ISO, and other styles
23

Bolón-Canedo, V., N. Sánchez-Maroño, and A. Alonso-Betanzos. "Feature selection and classification in multiple class datasets: An application to KDD Cup 99 dataset." Expert Systems with Applications 38, no. 5 (May 2011): 5947–57. http://dx.doi.org/10.1016/j.eswa.2010.11.028.

Full text
APA, Harvard, Vancouver, ISO, and other styles
24

LIU, RONG, ROBERT RALLO, and YORAM COHEN. "UNSUPERVISED FEATURE SELECTION USING INCREMENTAL LEAST SQUARES." International Journal of Information Technology & Decision Making 10, no. 06 (November 2011): 967–87. http://dx.doi.org/10.1142/s0219622011004671.

Full text
Abstract:
An unsupervised feature selection method is proposed for analysis of datasets of high dimensionality. The least square error (LSE) of approximating the complete dataset via a reduced feature subset is proposed as the quality measure for feature selection. Guided by the minimization of the LSE, a kernel least squares forward selection algorithm (KLS-FS) is developed that is capable of both linear and non-linear feature selection. An incremental LSE computation is designed to accelerate the selection process and, therefore, enhances the scalability of KLS-FS to high-dimensional datasets. The superiority of the proposed feature selection algorithm, in terms of keeping principal data structures, learning performances in classification and clustering applications, and robustness, is demonstrated using various real-life datasets of different sizes and dimensions.
APA, Harvard, Vancouver, ISO, and other styles
25

Murugesan, S., R. S. Bhuvaneswaran, H. Khanna Nehemiah, S. Keerthana Sankari, and Y. Nancy Jane. "Feature Selection and Classification of Clinical Datasets Using Bioinspired Algorithms and Super Learner." Computational and Mathematical Methods in Medicine 2021 (May 17, 2021): 1–18. http://dx.doi.org/10.1155/2021/6662420.

Full text
Abstract:
A computer-aided diagnosis (CAD) system that employs a super learner to diagnose the presence or absence of a disease has been developed. Each clinical dataset is preprocessed and split into training set (60%) and testing set (40%). A wrapper approach that uses three bioinspired algorithms, namely, cat swarm optimization (CSO), krill herd (KH) ,and bacterial foraging optimization (BFO) with the classification accuracy of support vector machine (SVM) as the fitness function has been used for feature selection. The selected features of each bioinspired algorithm are stored in three separate databases. The features selected by each bioinspired algorithm are used to train three back propagation neural networks (BPNN) independently using the conjugate gradient algorithm (CGA). Classifier testing is performed by using the testing set on each trained classifier, and the diagnostic results obtained are used to evaluate the performance of each classifier. The classification results obtained for each instance of the testing set of the three classifiers and the class label associated with each instance of the testing set will be the candidate instances for training and testing the super learner. The training set comprises of 80% of the instances, and the testing set comprises of 20% of the instances. Experimentation has been carried out using seven clinical datasets from the University of California Irvine (UCI) machine learning repository. The super learner has achieved a classification accuracy of 96.83% for Wisconsin diagnostic breast cancer dataset (WDBC), 86.36% for Statlog heart disease dataset (SHD), 94.74% for hepatocellular carcinoma dataset (HCC), 90.48% for hepatitis dataset (HD), 81.82% for vertebral column dataset (VCD), 84% for Cleveland heart disease dataset (CHD), and 70% for Indian liver patient dataset (ILP).
APA, Harvard, Vancouver, ISO, and other styles
26

Garg, Siddhant, Thuy Vu, and Alessandro Moschitti. "TANDA: Transfer and Adapt Pre-Trained Transformer Models for Answer Sentence Selection." Proceedings of the AAAI Conference on Artificial Intelligence 34, no. 05 (April 3, 2020): 7780–88. http://dx.doi.org/10.1609/aaai.v34i05.6282.

Full text
Abstract:
We propose TandA, an effective technique for fine-tuning pre-trained Transformer models for natural language tasks. Specifically, we first transfer a pre-trained model into a model for a general task by fine-tuning it with a large and high-quality dataset. We then perform a second fine-tuning step to adapt the transferred model to the target domain. We demonstrate the benefits of our approach for answer sentence selection, which is a well-known inference task in Question Answering. We built a large scale dataset to enable the transfer step, exploiting the Natural Questions dataset. Our approach establishes the state of the art on two well-known benchmarks, WikiQA and TREC-QA, achieving the impressive MAP scores of 92% and 94.3%, respectively, which largely outperform the the highest scores of 83.4% and 87.5% of previous work. We empirically show that TandA generates more stable and robust models reducing the effort required for selecting optimal hyper-parameters. Additionally, we show that the transfer step of TandA makes the adaptation step more robust to noise. This enables a more effective use of noisy datasets for fine-tuning. Finally, we also confirm the positive impact of TandA in an industrial setting, using domain specific datasets subject to different types of noise.
APA, Harvard, Vancouver, ISO, and other styles
27

Alshamlan, Hala M., Ghada H. Badr, and Yousef A. Alohali. "The Performance of Bio-Inspired Evolutionary Gene Selection Methods for Cancer Classification Using Microarray Dataset." International Journal of Bioscience, Biochemistry and Bioinformatics 4, no. 3 (2014): 166–70. http://dx.doi.org/10.7763/ijbbb.2014.v4.332.

Full text
APA, Harvard, Vancouver, ISO, and other styles
28

Morkonda Gunasekaran, Dinesh, and Prabha Dhandayudam. "Design of novel multi filter union feature selection framework for breast cancer dataset." Concurrent Engineering 29, no. 3 (May 31, 2021): 285–90. http://dx.doi.org/10.1177/1063293x211016046.

Full text
Abstract:
Nowadays women are commonly diagnosed with breast cancer. Feature based Selection method plays an important step while constructing a classification based framework. We have proposed Multi filter union (MFU) feature selection method for breast cancer data set. The feature selection process based on random forest algorithm and Logistic regression (LG) algorithm based union model is used for selecting important features in the dataset. The performance of the data analysis is evaluated using optimal features subset from selected dataset. The experiments are computed with data set of Wisconsin diagnostic breast cancer center and next the real data set from women health care center. The result of the proposed approach shows high performance and efficient when comparing with existing feature selection algorithms.
APA, Harvard, Vancouver, ISO, and other styles
29

Shayegan, Mohammad Amin, Saeed Aghabozorgi, and Ram Gopal Raj. "A Novel Two-Stage Spectrum-Based Approach for Dimensionality Reduction: A Case Study on the Recognition of Handwritten Numerals." Journal of Applied Mathematics 2014 (2014): 1–14. http://dx.doi.org/10.1155/2014/654787.

Full text
Abstract:
Dimensionality reduction (feature selection) is an important step in pattern recognition systems. Although there are different conventional approaches for feature selection, such as Principal Component Analysis, Random Projection, and Linear Discriminant Analysis, selecting optimal, effective, and robust features is usually a difficult task. In this paper, a new two-stage approach for dimensionality reduction is proposed. This method is based on one-dimensional and two-dimensional spectrum diagrams of standard deviation and minimum to maximum distributions for initial feature vector elements. The proposed algorithm is validated in an OCR application, by using two big standard benchmark handwritten OCR datasets, MNIST and Hoda. In the beginning, a 133-element feature vector was selected from the most used features, proposed in the literature. Finally, the size of initial feature vector was reduced from 100% to 59.40% (79 elements) for the MNIST dataset, and to 43.61% (58 elements) for the Hoda dataset, in order. Meanwhile, the accuracies of OCR systems are enhanced 2.95% for the MNIST dataset, and 4.71% for the Hoda dataset. The achieved results show an improvement in the precision of the system in comparison to the rival approaches, Principal Component Analysis and Random Projection. The proposed technique can also be useful for generating decision rules in a pattern recognition system using rule-based classifiers.
APA, Harvard, Vancouver, ISO, and other styles
30

Paul, Dipanjyoti, Rahul Kumar, Sriparna Saha, and Jimson Mathew. "Multi-objective Cuckoo Search-based Streaming Feature Selection for Multi-label Dataset." ACM Transactions on Knowledge Discovery from Data 15, no. 6 (May 19, 2021): 1–24. http://dx.doi.org/10.1145/3447586.

Full text
Abstract:
The feature selection method is the process of selecting only relevant features by removing irrelevant or redundant features amongst the large number of features that are used to represent data. Nowadays, many application domains especially social media networks, generate new features continuously at different time stamps. In such a scenario, when the features are arriving in an online fashion, to cope up with the continuous arrival of features, the selection task must also have to be a continuous process. Therefore, the streaming feature selection based approach has to be incorporated, i.e., every time a new feature or a group of features arrives, the feature selection process has to be invoked. Again, in recent years, there are many application domains that generate data where samples may belong to more than one classes called multi-label dataset. The multiple labels that the instances are being associated with, may have some dependencies amongst themselves. Finding the co-relation amongst the class labels helps to select the discriminative features across multiple labels. In this article, we develop streaming feature selection methods for multi-label data where the multiple labels are reduced to a lower-dimensional space. The similar labels are grouped together before performing the selection method to improve the selection quality and to make the model time efficient. The multi-objective version of the cuckoo search-based approach is used to select the optimal feature set. The proposed method develops two versions of the streaming feature selection method: ) when the features arrive individually and ) when the features arrive in the form of a batch. Various multi-label datasets from various domains such as text, biology, and audio have been used to test the developed streaming feature selection methods. The proposed methods are compared with many previous feature selection methods and from the comparison, the superiority of using multiple objectives and label co-relation in the feature selection process can be established.
APA, Harvard, Vancouver, ISO, and other styles
31

Leng, Mingwei, Jianjun Cheng, Jinjin Wang, Zhengquan Zhang, Hanhai Zhou, and Xiaoyun Chen. "Active Semisupervised Clustering Algorithm with Label Propagation for Imbalanced and Multidensity Datasets." Mathematical Problems in Engineering 2013 (2013): 1–10. http://dx.doi.org/10.1155/2013/641927.

Full text
Abstract:
The accuracy of most of the existing semisupervised clustering algorithms based on small size of labeled dataset is low when dealing with multidensity and imbalanced datasets, and labeling data is quite expensive and time consuming in many real-world applications. This paper focuses on active data selection and semisupervised clustering algorithm in multidensity and imbalanced datasets and proposes an active semisupervised clustering algorithm. The proposed algorithm uses an active mechanism for data selection to minimize the amount of labeled data, and it utilizes multithreshold to expand labeled datasets on multidensity and imbalanced datasets. Three standard datasets and one synthetic dataset are used to demonstrate the proposed algorithm, and the experimental results show that the proposed semisupervised clustering algorithm has a higher accuracy and a more stable performance in comparison to other clustering and semisupervised clustering algorithms, especially when the datasets are multidensity and imbalanced.
APA, Harvard, Vancouver, ISO, and other styles
32

Kabir Ahmad, Farzana, Yuhanis Yusof, and Nooraini Yusoff. "Filter-Based Gene Selection Method for Tissues Classification on Large Scale Gene Expression Data." International Journal of Engineering & Technology 7, no. 2.15 (April 6, 2018): 68. http://dx.doi.org/10.14419/ijet.v7i2.15.11216.

Full text
Abstract:
DNA microarray technology is a current innovative tool that has offers a new perspective to look sight into cellular systems and measure a large scale of gene expressions at once. Regardless the novel invention of DNA microarray, most of its results relies on the computational intelligence power, which is used to interpret the large number of data. At present, interpreting large scale of gene expression data remain a thought-provoking issue due to their innate nature of “high dimensional low sample size”. Microarray data mainly involved thousands of genes, n in a very small size sample, p. In addition, this data are often overwhelmed, over fitting and confused by the complexity of data analysis. Due to the nature of this microarray data, it is also common that a large number of genes may not be informative for classification purposes. For such a reason, many studies have used feature selection methods to select significant genes that present the maximum discriminative power between cancerous and normal tissues. In this study, we aim to investigate and compare the effectiveness of these four popular filter gene selection methods namely Signal-to-Noise ratio (SNR), Fisher Criterion (FC), Information Gain (IG) and t-Test in selecting informative genes that can distinguish cancer and normal tissues. Two common classifiers, Support Vector Machine (SVM) and Decision Tree (C4.5) are used to train the selected genes. These gene selection methods are tested on three large scales of gene expression datasets, namely breast cancer dataset, colon dataset, and lung dataset. This study has discovered that IG and SNR are more suitable to be used with SVM while IG fit for C4.5. In a colon dataset, SVM has achieved a specificity of 86% with SNR while and 80% for IG. In contract, C4.5 has obtained a specificity of 78% for IG on the identical dataset. These results indicate that SVM performed slightly better with IG pre-processed data compare to C4.5 on the same dataset.
APA, Harvard, Vancouver, ISO, and other styles
33

Tsukahara, Yoko, Terry A. Gipson, Ryszard Puchala, and Arthur L. Goetsch. "Selection Methods for Models to Predict Feedstuff Associative Effects in Goats." Journal of Animal Science 99, Supplement_2 (May 1, 2021): 39–40. http://dx.doi.org/10.1093/jas/skab096.072.

Full text
Abstract:
Abstract Animal nutrition models can be useful to gain an understanding of factors responsible for and predict biological responses. However, specific models developed can vary among statistical methods employed, especially with relatively small datasets. In this meta-analysis study, independent variables selected by regression tree, stepwise regression, and Least Absolute Shrinkage and Selection Operator (LASSO) analyses were compared. The database consisted of 135 treatment means (weighted by the number observations) from 25 publications in which goats consumed forage ad libitum with or without supplementation and was divided into three subsets of forage with a crude protein (CP) concentration &lt; 6% (Low; n = 46), 6–10% (Moderate; n = 50), and &gt; 10% (High; n = 39). Regression tree analysis was conducted with rpart of the R statistical programming language and stepwise regression (proc stepwise) and LASSO (proc glmselect) were conducted with SAS. The target variable was forage metabolizable energy (ME) intake relative to metabolic body weight (BW0.75) and potential predictor variables were supplement ME intake also scaled to BW0.75, forage organic matter (OM) digestibility and neutral detergent fiber (NDF) concentration, and supplement concentrations of ME and CP. As shown in Table 1, supplement ME intake was selected with each method. Based on the order of selection, forage NDF concentration had larger impact than supplement ME intake with the Low dataset, whereas supplement ME intake was most important with Moderate and High datasets. Among the statistical methods evaluated, selected variables varied most with the Low dataset, were similar for the Moderate dataset, and were the same for Stepwise and LASSO approaches with the High dataset. In conclusion, model development to predict feedstuff associative effects in goats requires careful attention to variable selection that is impacted by statistical method and varies with forage composition.
APA, Harvard, Vancouver, ISO, and other styles
34

Chin, Fung Yuen, and Yong Kheng Goh. "The new baseline for high dimensional dataset by ranked mutual information features." ITM Web of Conferences 36 (2021): 01014. http://dx.doi.org/10.1051/itmconf/20213601014.

Full text
Abstract:
Feature selection is a process of selecting a group of relevant features by removing unnecessary features for use in constructing the predictive model. However, high dimensional data increases the difficulty of feature selection due to the curse of dimensionality. From the past research, the performance of the predictive model is always compared with the existing results. When attempting to model a new dataset, the current practice is to benchmark for the dataset obtained by including all the features, including redundant features and noise. Here we propose a new optimal baseline for the dataset by mean of ranked features using a mutual information score. The quality of a dataset depends on the information contained in the dataset, and the more information contains in the dataset, the better the performance of the predictive model. The number of features to achieve this new optimal baseline will be obtained at the same time, and serve as the guideline on the number of features needed in a feature selection method. We will also show some experimental results that the proposed method provides a better baseline with fewer features compared to the existing benchmark using all the features.
APA, Harvard, Vancouver, ISO, and other styles
35

Jabor, Ali Hakem, and Ali Hussein Ali. "Dual Heuristic Feature Selection Based on Genetic Algorithm and Binary Particle Swarm Optimization." JOURNAL OF UNIVERSITY OF BABYLON for Pure and Applied Sciences 27, no. 1 (April 1, 2019): 171–83. http://dx.doi.org/10.29196/jubpas.v27i1.2106.

Full text
Abstract:
The features selection is one of the data mining tools that used to select the most important features of a given dataset. It contributes to save time and memory during the handling a given dataset. According to these principles, we have proposed features selection method based on mixing two metaheuristic algorithms Binary Particle Swarm Optimization and Genetic Algorithm work individually. The K-Nearest Neighbour (K-NN) is used as an objective function to evaluate the proposed features selection algorithm. The Dual Heuristic Feature Selection based on Genetic Algorithm and Binary Particle Swarm Optimization (DHFS) test, and compared with 26 well-known datasets of UCI machine learning. The numeric experiments result imply that the DHFS better performance compared with full features and that selected by the mentioned algorithms (Genetic Algorithm and Binary Particle Swarm Optimization).
APA, Harvard, Vancouver, ISO, and other styles
36

Thepade, Sudeep, Rik Das, and Saurav Ghosh. "Feature Extraction with Ordered Mean Values for Content Based Image Classification." Advances in Computer Engineering 2014 (December 17, 2014): 1–15. http://dx.doi.org/10.1155/2014/454876.

Full text
Abstract:
Categorization of images into meaningful classes by efficient extraction of feature vectors from image datasets has been dependent on feature selection techniques. Traditionally, feature vector extraction has been carried out using different methods of image binarization done with selection of global, local, or mean threshold. This paper has proposed a novel technique for feature extraction based on ordered mean values. The proposed technique was combined with feature extraction using discrete sine transform (DST) for better classification results using multitechnique fusion. The novel methodology was compared to the traditional techniques used for feature extraction for content based image classification. Three benchmark datasets, namely, Wang dataset, Oliva and Torralba (OT-Scene) dataset, and Caltech dataset, were used for evaluation purpose. Performance measure after evaluation has evidently revealed the superiority of the proposed fusion technique with ordered mean values and discrete sine transform over the popular approaches of single view feature extraction methodologies for classification.
APA, Harvard, Vancouver, ISO, and other styles
37

Sampathkumar, A., and P. Vivekanandan. "Gene Selection Using Parallel Lion Optimization Method in Microarray Data for Cancer Classification." Journal of Medical Imaging and Health Informatics 9, no. 6 (August 1, 2019): 1294–300. http://dx.doi.org/10.1166/jmihi.2019.2723.

Full text
Abstract:
In the field of bioinformatics research, a large volume of genetic data has been generated. Availability of higher throughput devices at lower cost has contributed to this generation of huge volumetric data. Handling such numerous data has become extremely challenging for selecting the relevant disease-causing gene. The development of microarray technology provides higher chances of cancer diagnosis, by enabling to measure the expression level of multiple genes at the same stretch. Selecting the relevant gene by using classifiers for investigation of gene expression data is a complicated process. Proper identification of gene from the gene expression datasets plays a vital role in improving the accuracy of classification. In this article, identification of the highly relevant gene from the gene expression data for cancer treatment is discussed in detail. By using modified meta-heuristic approach, known as 'parallel lion optimization' (PLOA) for selecting genes from microarray data that can classify various cancer sub-types with more accuracy. The experimental results depict that PLOA outperforms than LOA and other well-known approaches, considering the five benchmark cancer gene expression dataset. It returns 99% classification accuracy for the dataset namely Prostate, Lung, Leukemia and Central Nervous system (CNS) for top 200 genes. Prostate and Lymphoma dataset PLOA is 99.19% and 99.93% respectively. On evaluating the result with other algorithm, the higher level of accuracy in gene selection is achieved by the proposed algorithm.
APA, Harvard, Vancouver, ISO, and other styles
38

MARX, EDGARD, TOMMASO SORU, SAEEDEH SHEKARPOUR, SÖREN AUER, AXEL-CYRILLE NGONGA NGOMO, and KARIN BREITMAN. "TOWARDS AN EFFICIENT RDF DATASET SLICING." International Journal of Semantic Computing 07, no. 04 (December 2013): 455–77. http://dx.doi.org/10.1142/s1793351x13400151.

Full text
Abstract:
Over the last years, a considerable amount of structured data has been published on the Web as Linked Open Data (LOD). Despite recent advances, consuming and using Linked Open Data within an organization is still a substantial challenge. Many of the LOD datasets are quite large and despite progress in Resource Description Framework (RDF) data management their loading and querying within a triple store is extremely time-consuming and resource-demanding. To overcome this consumption obstacle, we propose a process inspired by the classical Extract-Transform-Load (ETL) paradigm. In this article, we focus particularly on the selection and extraction steps of this process. We devise a fragment of SPARQL Protocol and RDF Query Language (SPARQL) dubbed SliceSPARQL, which enables the selection of well-defined slices of datasets fulfilling typical information needs. SliceSPARQL supports graph patterns for which each connected subgraph pattern involves a maximum of one variable or Internationalized resource identifier (IRI) in its join conditions. This restriction guarantees the efficient processing of the query against a sequential dataset dump stream. Furthermore, we evaluate our slicing approach on three different optimization strategies. Results show that dataset slices can be generated an order of magnitude faster than by using the conventional approach of loading the whole dataset into a triple store.
APA, Harvard, Vancouver, ISO, and other styles
39

Guha, Ritam, Manosij Ghosh, Pawan Kumar Singh, Ram Sarkar, and Mita Nasipuri. "M-HMOGA: A New Multi-Objective Feature Selection Algorithm for Handwritten Numeral Classification." Journal of Intelligent Systems 29, no. 1 (June 14, 2019): 1453–67. http://dx.doi.org/10.1515/jisys-2019-0064.

Full text
Abstract:
Abstract The feature selection process is very important in the field of pattern recognition, which selects the informative features so as to reduce the curse of dimensionality, thus improving the overall classification accuracy. In this paper, a new feature selection approach named Memory-Based Histogram-Oriented Multi-objective Genetic Algorithm (M-HMOGA) is introduced to identify the informative feature subset to be used for a pattern classification problem. The proposed M-HMOGA approach is applied to two recently used feature sets, namely Mojette transform and Regional Weighted Run Length features. The experimentations are carried out on Bangla, Devanagari, and Roman numeral datasets, which are the three most popular scripts used in the Indian subcontinent. In-house Bangla and Devanagari script datasets and Competition on Handwritten Digit Recognition (HDRC) 2013 Roman numeral dataset are used for evaluating our model. Moreover, as proof of robustness, we have applied an innovative approach of using different datasets for training and testing. We have used in-house Bangla and Devanagari script datasets for training the model, and the trained model is then tested on Indian Statistical Institute numeral datasets. For Roman numerals, we have used the HDRC 2013 dataset for training and the Modified National Institute of Standards and Technology dataset for testing. Comparison of the results obtained by the proposed model with existing HMOGA and MOGA techniques clearly indicates the superiority of M-HMOGA over both of its ancestors. Moreover, use of K-nearest neighbor as well as multi-layer perceptron as classifiers speaks for the classifier-independent nature of M-HMOGA. The proposed M-HMOGA model uses only about 45–50% of the total feature set in order to achieve around 1% increase when the same datasets are partitioned for training-testing and a 2–3% increase in the classification ability while using only 35–45% features when different datasets are used for training-testing with respect to the situation when all the features are used for classification.
APA, Harvard, Vancouver, ISO, and other styles
40

Liu, Yimo, Wanchang Zhang, Zhijie Zhang, Qiang Xu, and Weile Li. "Risk Factor Detection and Landslide Susceptibility Mapping Using Geo-Detector and Random Forest Models: The 2018 Hokkaido Eastern Iburi Earthquake." Remote Sensing 13, no. 6 (March 18, 2021): 1157. http://dx.doi.org/10.3390/rs13061157.

Full text
Abstract:
Landslide susceptibility mapping is an effective approach for landslide risk prevention and assessments. The occurrence of slope instability is highly correlated with intrinsic variables that contribute to the occurrence of landslides, such as geology, geomorphology, climate, hydrology, etc. However, feature selection of those conditioning factors to constitute datasets with optimal predictive capability effectively and accurately is still an open question. The present study aims to examine further the integration of the selected landslide conditioning factors with Q-statistic in Geo-detector for determining stratification and selection of landslide conditioning factors in landslide risk analysis as to ultimately optimize landslide susceptibility model prediction. The location chosen for the study was Atsuma Town, which suffered from landslides following the Eastern Iburi Earthquake in 2018 in Hokkaido, Japan. A total of 13 conditioning factors were obtained from different sources belonging to six categories: geology, geomorphology, seismology, hydrology, land cover/use and human activity; these were selected to generate the datasets for landslide susceptibility mapping. The original datasets of landslide conditioning factors were analyzed with Q-statistic in Geo-detector to examine their explanatory powers regarding the occurrence of landslides. A Random Forest (RF) model was adopted for landslide susceptibility mapping. Subsequently, four subsets, including the Manually delineated landslide Points with 9 features Dataset (MPD9), the Randomly delineated landslide Points with 9 features Dataset (RPD9), the Manually delineated landslide Points with 13 features Dataset (MPD13), and the Randomly delineated landslide Points with 13 features Dataset (RPD13), were selected by an analysis of Q-statistic for training and validating the Geo-detector-RF- integrated model. Overall, using dataset MPD9, the Geo-detector-RF-integrated model yielded the highest prediction accuracy (89.90%), followed by using dataset MPD13 (89.53%), dataset RPD13 (88.63%) and dataset RPD9 (87.07%), which implied that optimized conditioning factors can effectively improve the prediction accuracy of landslide susceptibility mapping.
APA, Harvard, Vancouver, ISO, and other styles
41

Wang, Hao, Suxing Lyu, and Yaxin Ren. "Paddy Rice Imagery Dataset for Panicle Segmentation." Agronomy 11, no. 8 (July 31, 2021): 1542. http://dx.doi.org/10.3390/agronomy11081542.

Full text
Abstract:
Accurate panicle identification is a key step in rice-field phenotyping. Deep learning methods based on high-spatial-resolution images provide a high-throughput and accurate solution of panicle segmentation. Panicle segmentation tasks require costly annotations to train an accurate and robust deep learning model. However, few public datasets are available for rice-panicle phenotyping. We present a semi-supervised deep learning model training process, which greatly assists the annotation and refinement of training datasets. The model learns the panicle features with limited annotations and localizes more positive samples in the datasets, without further interaction. After the dataset refinement, the number of annotations increased by 40.6%. In addition, we trained and tested modern deep learning models to show how the dataset is beneficial to both detection and segmentation tasks. Results of our comparison experiments can inspire others in dataset preparation and model selection.
APA, Harvard, Vancouver, ISO, and other styles
42

Jothi, Neesha, Wahidah Husain, Nur’Aini Abdul Rashid, and Sharifah Mashita Syed-Mohamad. "Feature Selection Method using Genetic Algorithm for Medical Dataset." International Journal on Advanced Science, Engineering and Information Technology 9, no. 6 (December 24, 2019): 1907. http://dx.doi.org/10.18517/ijaseit.9.6.10226.

Full text
APA, Harvard, Vancouver, ISO, and other styles
43

Burrell, Arden L., Jason P. Evans, and Yi Liu. "The impact of dataset selection on land degradation assessment." ISPRS Journal of Photogrammetry and Remote Sensing 146 (December 2018): 22–37. http://dx.doi.org/10.1016/j.isprsjprs.2018.08.017.

Full text
APA, Harvard, Vancouver, ISO, and other styles
44

Lan, Zirui, Yisi Liu, Olga Sourina, Lipo Wang, Reinhold Scherer, and Gernot Müller-Putz. "SAFE: An EEG dataset for stable affective feature selection." Advanced Engineering Informatics 44 (April 2020): 101047. http://dx.doi.org/10.1016/j.aei.2020.101047.

Full text
APA, Harvard, Vancouver, ISO, and other styles
45

Salfikar, Inzar, Indra Adji Sulistijono, and Achmad Basuki. "Automatic Samples Selection Using Histogram of Oriented Gradients (HOG) Feature Distance." EMITTER International Journal of Engineering Technology 5, no. 2 (January 13, 2018): 234–54. http://dx.doi.org/10.24003/emitter.v5i2.182.

Full text
Abstract:
Finding victims at a disaster site is the primary goal of Search-and-Rescue (SAR) operations. Many technologies created from research for searching disaster victims through aerial imaging. but, most of them are difficult to detect victims at tsunami disaster sites with victims and backgrounds which are look similar. This research collects post-tsunami aerial imaging data from the internet to builds dataset and model for detecting tsunami disaster victims. Datasets are built based on distance differences from features every sample using Histogram-of-Oriented-Gradient (HOG) method. We use the longest distance to collect samples from photo to generate victim and non-victim samples. We claim steps to collect samples by measuring HOG feature distance from all samples. the longest distance between samples will take as a candidate to build the dataset, then classify victim (positives) and non-victim (negatives) samples manually. The dataset of tsunami disaster victims was re-analyzed using cross-validation Leave-One-Out (LOO) with Support-Vector-Machine (SVM) method. The experimental results show the performance of two test photos with 61.70% precision, 77.60% accuracy, 74.36% recall and f-measure 67.44% to distinguish victim (positives) and non-victim (negatives).
APA, Harvard, Vancouver, ISO, and other styles
46

Jeon, Hyelynn, and Sejong Oh. "Hybrid-Recursive Feature Elimination for Efficient Feature Selection." Applied Sciences 10, no. 9 (May 4, 2020): 3211. http://dx.doi.org/10.3390/app10093211.

Full text
Abstract:
As datasets continue to increase in size, it is important to select the optimal feature subset from the original dataset to obtain the best performance in machine learning tasks. Highly dimensional datasets that have an excessive number of features can cause low performance in such tasks. Overfitting is a typical problem. In addition, datasets that are of high dimensionality can create shortages in space and require high computing power, and models fitted to such datasets can produce low classification accuracies. Thus, it is necessary to select a representative subset of features by utilizing an efficient selection method. Many feature selection methods have been proposed, including recursive feature elimination. In this paper, a hybrid-recursive feature elimination method is presented which combines the feature-importance-based recursive feature elimination methods of the support vector machine, random forest, and generalized boosted regression algorithms. From the experiments, we confirm that the performance of the proposed method is superior to that of the three single recursive feature elimination methods.
APA, Harvard, Vancouver, ISO, and other styles
47

CHEN, PENG, CHUNMEI LIU, LEGAND BURGE, MOHAMMAD MAHMOOD, WILLIAM SOUTHERLAND, and CLAY GLOSTER. "PROTEIN FOLD CLASSIFICATION WITH GENETIC ALGORITHMS AND FEATURE SELECTION." Journal of Bioinformatics and Computational Biology 07, no. 05 (October 2009): 773–88. http://dx.doi.org/10.1142/s0219720009004321.

Full text
Abstract:
Protein fold classification is a key step to predicting protein tertiary structures. This paper proposes a novel approach based on genetic algorithms and feature selection to classifying protein folds. Our dataset is divided into a training dataset and a test dataset. Each individual for the genetic algorithms represents a selection function of the feature vectors of the training dataset. A support vector machine is applied to each individual to evaluate the fitness value (fold classification rate) of each individual. The aim of the genetic algorithms is to search for the best individual that produces the highest fold classification rate. The best individual is then applied to the feature vectors of the test dataset and a support vector machine is built to classify protein folds based on selected features. Our experimental results on Ding and Dubchak's benchmark dataset of 27-class folds show that our approach achieves an accuracy of 71.28%, which outperforms current state-of-the-art protein fold predictors.
APA, Harvard, Vancouver, ISO, and other styles
48

Hussien, Hussien Rezk, El-Sayed M. El-Kenawy, and Ali I. El-Desouky. "EEG Channel Selection Using A Modified Grey Wolf Optimizer." European Journal of Electrical Engineering and Computer Science 5, no. 1 (January 12, 2021): 17–24. http://dx.doi.org/10.24018/ejece.2021.5.1.265.

Full text
Abstract:
Consider an increasingly growing field of research, Brain-Computer Interface (BCI) is to form a direct channel of communication between a computer and the brain. However, extracting features of random time-varying EEG signals and their classification is a major challenge that faces current BCI. This paper proposes a modified grey wolf optimizer (MGWO) that can select optimal EEG channels to be used in (BCIs), the way that identifies main features and the immaterial ones from that dataset and the complexity to be removed. This allows (MGWO) to opt for optimal EEG channels as well as helping machine learning classification in its tasks when doing training to the classifier with the dataset. (MGWO), which imitates the grey wolves leadership and hunting manner nature and which consider metaheuristics swarm intelligence algorithms, is an integration with two modification to achieve the balance between exploration and exploitation the first modification applies exponential change for the number of iterations to increase search space accordingly exploitation, the second modification is the crossover operation that is used to increase the diversity of the population and enhance exploitation capability. Experimental results use four different EEG datasets BCI Competition IV- dataset 2a, BCI Competition IV- data set III, BCI Competition II data set III, and EEG Eye State from UCI Machine Learning Repository to evaluate the quality and effectiveness of the (MGWO). A cross-validation method is used to measure the stability of the (MGWO).
APA, Harvard, Vancouver, ISO, and other styles
49

Chen, Xi, and Afsaneh Doryab. "Optimizing the Feature Selection Process for Better Accuracy in Datasets with a Large Number of Features (Student Abstract)." Proceedings of the AAAI Conference on Artificial Intelligence 34, no. 10 (April 3, 2020): 13767–68. http://dx.doi.org/10.1609/aaai.v34i10.7155.

Full text
Abstract:
Most feature selection methods only perform well on datasets with relatively small set of features. In the case of large feature sets and small number of data points, almost none of the existing feature selection methods help in achieving high accuracy. This paper proposes a novel approach to optimize the feature selection process through Frequent Pattern Growth algorithm to find sets of features that appear frequently among the top features selected by the main feature selection methods. Our experimental evaluation on two datasets containing a small and very large number of features shows that our approach significantly improves the accuracy results of the dataset with a very large number of features.
APA, Harvard, Vancouver, ISO, and other styles
50

Rook, A. J., M. Ellis, C. T. Whittemore, and P. Phillips. "Relationships between whole-body chemical composition, physically dissected carcass parts and backfat measurements in pigs." Animal Science 44, no. 2 (April 1987): 263–73. http://dx.doi.org/10.1017/s0003356100018638.

Full text
Abstract:
ABSTRACTLog-linear relationships between various measurements of the chemical and physical body composition of the pig were obtained in four datasets representing a range of sexes, genotypes and feeding treatments. One of these datasets (dataset 1) comprised genetic control and selection line Large White pigs. There were significant differences between datasets for most of the relationships investigated. The causes of the differences cannot be determined. Within datasets, relationships between various body components and the weight of crude protein in the whole body were unaffected by genotype or sex. The relationships of both intermuscular fat and trimmed carcass lipid with whole body lipid differed significantly between the control and selection lines in dataset 1. Fat thickness measurements taken over the m. longissimus at the last rib were less at the same body lipid in the selection line than the control line suggesting a redistribution of fat away from this area as a result of selection. Relationships between viscera, lungs and empty body weight were significantly affected by line while those between trimmed carcass, liver, kidneys and empty body weight were significantly affected by sex. Selection line pigs had less perinephric and retroperitoneal fat than controls at the same whole body fat weight and less subcutaneous fat at the same cold carcass weight. There were no significant line effects on lean or bone weight distribution. Selection line pigs had significantly less subcutaneous fat in the collar joint and more intermuscular fat in the ham. There were few significant sex effects on tissue weight distribution.
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

To the bibliography