To see the other types of publications on this topic, follow the link: Numeric and categorical data.

Journal articles on the topic 'Numeric and categorical data'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 journal articles for your research on the topic 'Numeric and categorical data.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse journal articles on a wide variety of disciplines and organise your bibliography correctly.

1

Suguna, J., and M. Arul Selvi. "Ensemble Fuzzy Clustering for Mixed Numeric and Categorical Data." International Journal of Computer Applications 42, no. 3 (March 31, 2012): 19–23. http://dx.doi.org/10.5120/5672-7705.

Full text
APA, Harvard, Vancouver, ISO, and other styles
2

Ji, Jinchao, Wei Pang, Zairong Li, Fei He, Guozhong Feng, and Xiaowei Zhao. "Clustering Mixed Numeric and Categorical Data With Cuckoo Search." IEEE Access 8 (2020): 30988–1003. http://dx.doi.org/10.1109/access.2020.2973216.

Full text
APA, Harvard, Vancouver, ISO, and other styles
3

Wu, Chengyuan, and Carol Anne Hargreaves. "Topological Machine Learning for Mixed Numeric and Categorical Data." International Journal on Artificial Intelligence Tools 30, no. 05 (August 2021): 2150025. http://dx.doi.org/10.1142/s0218213021500251.

Full text
Abstract:
Topological data analysis is a relatively new branch of machine learning that excels in studying high-dimensional data, and is theoretically known to be robust against noise. Meanwhile, data objects with mixed numeric and categorical attributes are ubiquitous in real-world applications. However, topological methods are usually applied to point cloud data, and to the best of our knowledge there is no available framework for the classification of mixed data using topological methods. In this paper, we propose a novel topological machine learning method for mixed data classification. In the proposed method, we use theory from topological data analysis such as persistent homology, persistence diagrams and Wasserstein distance to study mixed data. The performance of the proposed method is demonstrated by experiments on a real-world heart disease dataset. Experimental results show that our topological method outperforms several state-of-the-art algorithms in the prediction of heart disease.
APA, Harvard, Vancouver, ISO, and other styles
4

Lee, Kyung Mi, and Keon Myung Lee. "A Locality Sensitive Hashing Technique for Categorical Data." Applied Mechanics and Materials 241-244 (December 2012): 3159–64. http://dx.doi.org/10.4028/www.scientific.net/amm.241-244.3159.

Full text
Abstract:
The measured data may contain various types of attributes such as continuous, categorical, and set-valued attributes. Several locality-sensitive hashing techniques, which enable to find similar pairs of data in a fast and approximate way, have been developed for data with either numeric or set-valued attributes. This paper introduces a new locality sensitive-hashing technique applicable to data with categorical attributes.
APA, Harvard, Vancouver, ISO, and other styles
5

Arunprabha, K., and V. Bhuvaneswari. "Comparing K-Value Estimation for Categorical and Numeric Data Clustring." International Journal of Computer Applications 11, no. 3 (December 10, 2010): 4–7. http://dx.doi.org/10.5120/1565-1875.

Full text
APA, Harvard, Vancouver, ISO, and other styles
6

Chrisinta, Debora, I. Made Sumertajaya, and Indahwati Indahwati. "EVALUASI KINERJA METODE CLUSTER ENSEMBLE DAN LATENT CLASS CLUSTERING PADA PEUBAH CAMPURAN." Indonesian Journal of Statistics and Its Applications 4, no. 3 (November 30, 2020): 448–61. http://dx.doi.org/10.29244/ijsa.v4i3.630.

Full text
Abstract:
Most of the traditional clustering algorithms are designed to focus either on numeric data or on categorical data. The collected data in the real-world often contain both numeric and categorical attributes. It is difficult for applying traditional clustering algorithms directly to these kinds of data. So, the paper aims to show the best method based on the cluster ensemble and latent class clustering approach for mixed data. Cluster ensemble is a method to combine different clustering results from two sub-datasets: the categorical and numerical variables. Then, clustering algorithms are designed for numerical and categorical datasets that are employed to produce corresponding clusters. On the other side, latent class clustering is a model-based clustering used for any type of data. The numbers of clusters base on the estimation of the probability model used. The best clustering method recommends LCC, which provides higher accuracy and the smallest standard deviation ratio. However, both LCC and cluster ensemble methods produce evaluation values that are not much different as the application method used potential village data in Bengkulu Province for clustering.
APA, Harvard, Vancouver, ISO, and other styles
7

Battaglia, Elena, Simone Celano, and Ruggero G. Pensa. "Differentially Private Distance Learning in Categorical Data." Data Mining and Knowledge Discovery 35, no. 5 (July 13, 2021): 2050–88. http://dx.doi.org/10.1007/s10618-021-00778-0.

Full text
Abstract:
AbstractMost privacy-preserving machine learning methods are designed around continuous or numeric data, but categorical attributes are common in many application scenarios, including clinical and health records, census and survey data. Distance-based methods, in particular, have limited applicability to categorical data, since they do not capture the complexity of the relationships among different values of a categorical attribute. Although distance learning algorithms exist for categorical data, they may disclose private information about individual records if applied to a secret dataset. To address this problem, we introduce a differentially private family of algorithms for learning distances between any pair of values of a categorical attribute according to the way they are co-distributed with the values of other categorical attributes forming the so-called context. We define different variants of our algorithm and we show empirically that our approach consumes little privacy budget while providing accurate distances, making it suitable in distance-based applications, such as clustering and classification.
APA, Harvard, Vancouver, ISO, and other styles
8

Ji, Jinchao, Yongbing Chen, Guozhong Feng, Xiaowei Zhao, and Fei He. "Clustering mixed numeric and categorical data with artificial bee colony strategy." Journal of Intelligent & Fuzzy Systems 36, no. 2 (March 16, 2019): 1521–30. http://dx.doi.org/10.3233/jifs-18146.

Full text
APA, Harvard, Vancouver, ISO, and other styles
9

Ahmad, Amir, and Lipika Dey. "A k-mean clustering algorithm for mixed numeric and categorical data." Data & Knowledge Engineering 63, no. 2 (November 2007): 503–27. http://dx.doi.org/10.1016/j.datak.2007.03.016.

Full text
APA, Harvard, Vancouver, ISO, and other styles
10

Ji, Jinchao, Ruonan Li, Wei Pang, Fei He, Guozhong Feng, and Xiaowei Zhao. "A Multi-View Clustering Algorithm for Mixed Numeric and Categorical Data." IEEE Access 9 (2021): 24913–24. http://dx.doi.org/10.1109/access.2021.3057113.

Full text
APA, Harvard, Vancouver, ISO, and other styles
11

Ji, Jinchao, Wei Pang, Yanlin Zheng, Zhe Wang, and Zhiqiang Ma. "An Initialization Method for Clustering Mixed Numeric and Categorical Data Based on the Density and Distance." International Journal of Pattern Recognition and Artificial Intelligence 29, no. 07 (September 28, 2015): 1550024. http://dx.doi.org/10.1142/s021800141550024x.

Full text
Abstract:
Most of the initialization approaches are dedicated to the partitional clustering algorithms which process categorical or numerical data only. However, in real-world applications, data objects with both numeric and categorical features are ubiquitous. The coexistence of both categorical and numerical attributes make the initialization methods designed for single-type data inapplicable to mixed-type data. Furthermore, to the best of our knowledge, in the existing partitional clustering algorithms designed for mixed-type data, the initial cluster centers are determined randomly. In this paper, we propose a novel initialization method for mixed data clustering. In the proposed method, both the distance and density are exploited together to determine initial cluster centers. The performance of the proposed method is demonstrated by a series of experiments on three real-world datasets in comparison with that of traditional initialization methods.
APA, Harvard, Vancouver, ISO, and other styles
12

Koren, Oded, Carina Antonia Hallin, Nir Perel, and Dror Bendet. "Decision-Making Enhancement in a Big Data Environment: Application of the K-Means Algorithm to Mixed Data." Journal of Artificial Intelligence and Soft Computing Research 9, no. 4 (October 1, 2019): 293–302. http://dx.doi.org/10.2478/jaiscr-2019-0010.

Full text
Abstract:
Abstract Big data research has become an important discipline in information systems research. However, the flood of data being generated on the Internet is increasingly unstructured and non-numeric in the form of images and texts. Thus, research indicates that there is an increasing need to develop more efficient algorithms for treating mixed data in big data for effective decision making. In this paper, we apply the classical K-means algorithm to both numeric and categorical attributes in big data platforms. We first present an algorithm that handles the problem of mixed data. We then use big data platforms to implement the algorithm, demonstrating its functionalities by applying the algorithm in a detailed case study. This provides us with a solid basis for performing more targeted profiling for decision making and research using big data. Consequently, the decision makers will be able to treat mixed data, numerical and categorical data, to explain and predict phenomena in the big data ecosystem. Our research includes a detailed end-to-end case study that presents an implementation of the suggested procedure. This demonstrates its capabilities and the advantages that allow it to improve the decision-making process by targeting organizations’ business requirements to a specific cluster[s]/profiles[s] based on the enhancement outcomes.
APA, Harvard, Vancouver, ISO, and other styles
13

K Roy, Dharmendra, and Lokesh K Sharma. "Genetic K-Means Clustering Algorithm for Mixed Numeric and Categorical Data Sets." International Journal of Artificial Intelligence & Applications 1, no. 2 (April 25, 2010): 23–28. http://dx.doi.org/10.5121/ijaia.2010.1203.

Full text
APA, Harvard, Vancouver, ISO, and other styles
14

Ji, Jinchao, Wei Pang, Chunguang Zhou, Xiao Han, and Zhe Wang. "A fuzzy k-prototype clustering algorithm for mixed numeric and categorical data." Knowledge-Based Systems 30 (June 2012): 129–35. http://dx.doi.org/10.1016/j.knosys.2012.01.006.

Full text
APA, Harvard, Vancouver, ISO, and other styles
15

Ji, Jinchao, Tian Bai, Chunguang Zhou, Chao Ma, and Zhe Wang. "An improved k-prototypes clustering algorithm for mixed numeric and categorical data." Neurocomputing 120 (November 2013): 590–96. http://dx.doi.org/10.1016/j.neucom.2013.04.011.

Full text
APA, Harvard, Vancouver, ISO, and other styles
16

Sen, Wu, Chen Hong, and Feng Xiaodong. "Clustering Algorithm for Incomplete Data Sets with Mixed Numeric and Categorical Attributes." International Journal of Database Theory and Application 6, no. 5 (October 31, 2013): 95–104. http://dx.doi.org/10.14257/ijdta.2013.6.5.09.

Full text
APA, Harvard, Vancouver, ISO, and other styles
17

Mathews, Lincy, and HariSeetha. "Efficient Learning From Two-Class Categorical Imbalanced Healthcare Data." International Journal of Healthcare Information Systems and Informatics 16, no. 1 (January 2021): 81–100. http://dx.doi.org/10.4018/ijhisi.2021010105.

Full text
Abstract:
When data classes are differently represented in one v. other data segment to be mined, it generates the imbalanced two-class data challenge. Many health-related datasets comprising categorical data are faced with the class imbalance challenge. This paper aims to address the limitations of imbalanced two-class categorical data and presents a re-sampling solution known as ‘Syn_Gen_Min' (SGM) to improve the class imbalance ratio. SGM involves finding the greedy neighbors for a given minority sample. To the best of one's knowledge, the accepted approach for a classifier is to find the numeric equivalence for categorical attributes, resulting in the loss of information. The novelty of this contribution is that the categorical attributes are kept in their raw form. Five distinct categorical similarity measures are employed and tested against six real-world datasets derived within the healthcare sector. The application of these similarity methods leads to the generation of different synthetic samples, which has significantly improved the performance measures of the classifier. This work further proves that there is no generic similarity measure that fits all datasets.
APA, Harvard, Vancouver, ISO, and other styles
18

Bathla, Gourav, Himanshu Aggarwal, and Rinkle Rani. "A Novel Approach for Clustering Big Data based on MapReduce." International Journal of Electrical and Computer Engineering (IJECE) 8, no. 3 (June 1, 2018): 1711. http://dx.doi.org/10.11591/ijece.v8i3.pp1711-1719.

Full text
Abstract:
Clustering is one of the most important applications of data mining. It has attracted attention of researchers in statistics and machine learning. It is used in many applications like information retrieval, image processing and social network analytics etc. It helps the user to understand the similarity and dissimilarity between objects. Cluster analysis makes the users understand complex and large data sets more clearly. There are different types of clustering algorithms analyzed by various researchers. Kmeans is the most popular partitioning based algorithm as it provides good results because of accurate calculation on numerical data. But Kmeans give good results for numerical data only. Big data is combination of numerical and categorical data. Kprototype algorithm is used to deal with numerical as well as categorical data. Kprototype combines the distance calculated from numeric and categorical data. With the growth of data due to social networking websites, business transactions, scientific calculation etc., there is vast collection of structured, semi-structured and unstructured data. So, there is need of optimization of Kprototype so that these varieties of data can be analyzed efficiently.In this work, Kprototype algorithm is implemented on MapReduce in this paper. Experiments have proved that Kprototype implemented on Mapreduce gives better performance gain on multiple nodes as compared to single node. CPU execution time and speedup are used as evaluation metrics for comparison.Intellegent splitter is proposed in this paper which splits mixed big data into numerical and categorical data. Comparison with traditional algorithms proves that proposed algorithm works better for large scale of data.
APA, Harvard, Vancouver, ISO, and other styles
19

Zhang, Kang, and Xingsheng Gu. "An Affinity Propagation Clustering Algorithm for Mixed Numeric and Categorical Datasets." Mathematical Problems in Engineering 2014 (2014): 1–8. http://dx.doi.org/10.1155/2014/486075.

Full text
Abstract:
Clustering has been widely used in different fields of science, technology, social science, and so forth. In real world, numeric as well as categorical features are usually used to describe the data objects. Accordingly, many clustering methods can process datasets that are either numeric or categorical. Recently, algorithms that can handle the mixed data clustering problems have been developed. Affinity propagation (AP) algorithm is an exemplar-based clustering method which has demonstrated good performance on a wide variety of datasets. However, it has limitations on processing mixed datasets. In this paper, we propose a novel similarity measure for mixed type datasets and an adaptive AP clustering algorithm is proposed to cluster the mixed datasets. Several real world datasets are studied to evaluate the performance of the proposed algorithm. Comparisons with other clustering algorithms demonstrate that the proposed method works well not only on mixed datasets but also on pure numeric and categorical datasets.
APA, Harvard, Vancouver, ISO, and other styles
20

Chung-Chian Hsu and Shu-Han Lin. "Visualized Analysis of Mixed Numeric and Categorical Data Via Extended Self-Organizing Map." IEEE Transactions on Neural Networks and Learning Systems 23, no. 1 (January 2012): 72–86. http://dx.doi.org/10.1109/tnnls.2011.2178323.

Full text
APA, Harvard, Vancouver, ISO, and other styles
21

Nooraeni, Rani, Muhamad Iqbal Arsa, and Nucke Widowati Kusumo Projo. "Fuzzy Centroid and Genetic Algorithms: Solutions for Numeric and Categorical Mixed Data Clustering." Procedia Computer Science 179 (2021): 677–84. http://dx.doi.org/10.1016/j.procs.2021.01.055.

Full text
APA, Harvard, Vancouver, ISO, and other styles
22

Jang, Hong-Jun, Byoungwook Kim, Jongwan Kim, and Soon-Young Jung. "An Efficient Grid-Based K-Prototypes Algorithm for Sustainable Decision-Making on Spatial Objects." Sustainability 10, no. 8 (July 25, 2018): 2614. http://dx.doi.org/10.3390/su10082614.

Full text
Abstract:
Data mining plays a critical role in sustainable decision-making. Although the k-prototypes algorithm is one of the best-known algorithms for clustering both numeric and categorical data, clustering a large number of spatial objects with mixed numeric and categorical attributes is still inefficient due to complexity. In this paper, we propose an efficient grid-based k-prototypes algorithm, GK-prototypes, which achieves high performance for clustering spatial objects. The first proposed algorithm utilizes both maximum and minimum distance between cluster centers and a cell, which can reduce unnecessary distance calculation. The second proposed algorithm as an extension of the first proposed algorithm, utilizes spatial dependence; spatial data tends to be similar to objects that are close. Each cell has a bitmap index which stores the categorical values of all objects within the same cell for each attribute. This bitmap index can improve performance if the categorical data is skewed. Experimental results show that the proposed algorithms can achieve better performance than the existing pruning techniques of the k-prototypes algorithm.
APA, Harvard, Vancouver, ISO, and other styles
23

Kim, Kyoungok, and Jung-sik Hong. "A hybrid decision tree algorithm for mixed numeric and categorical data in regression analysis." Pattern Recognition Letters 98 (October 2017): 39–45. http://dx.doi.org/10.1016/j.patrec.2017.08.011.

Full text
APA, Harvard, Vancouver, ISO, and other styles
24

Lyles, Robert H., Lawrence L. Kupper, Huiman X. Barnhart, and Sandra L. Martin. "Numeric score-based conditional and overall change-in-status indices for ordered categorical data." Statistics in Medicine 34, no. 27 (July 2, 2015): 3622–36. http://dx.doi.org/10.1002/sim.6559.

Full text
APA, Harvard, Vancouver, ISO, and other styles
25

CHEN, Wei, Lei WANG, and Zi-yun JIANG. "K-prototypes based clustering algorithm for data mixed with numeric and categorical values." Journal of Computer Applications 30, no. 8 (September 2, 2010): 2003–5. http://dx.doi.org/10.3724/sp.j.1087.2010.02003.

Full text
APA, Harvard, Vancouver, ISO, and other styles
26

Sadinle, Mauricio, and Jerome P. Reiter. "Sequentially additive nonignorable missing data modelling using auxiliary marginal information." Biometrika 106, no. 4 (October 26, 2019): 889–911. http://dx.doi.org/10.1093/biomet/asz054.

Full text
Abstract:
Summary We study a class of missingness mechanisms, referred to as sequentially additive nonignorable, for modelling multivariate data with item nonresponse. These mechanisms explicitly allow the probability of nonresponse for each variable to depend on the value of that variable, thereby representing nonignorable missingness mechanisms. These missing data models are identified by making use of auxiliary information on marginal distributions, such as marginal probabilities for multivariate categorical variables or moments for numeric variables. We prove identification results and illustrate the use of these mechanisms in an application.
APA, Harvard, Vancouver, ISO, and other styles
27

Lopez-Arevalo, Ivan, Edwin Aldana-Bobadilla, Alejandro Molina-Villegas, Hiram Galeana-Zapién, Victor Muñiz-Sanchez, and Saul Gausin-Valle. "A Memory-Efficient Encoding Method for Processing Mixed-Type Data on Machine Learning." Entropy 22, no. 12 (December 9, 2020): 1391. http://dx.doi.org/10.3390/e22121391.

Full text
Abstract:
The most common machine-learning methods solve supervised and unsupervised problems based on datasets where the problem’s features belong to a numerical space. However, many problems often include data where numerical and categorical data coexist, which represents a challenge to manage them. To transform categorical data into a numeric form, preprocessing tasks are compulsory. Methods such as one-hot and feature-hashing have been the most widely used encoding approaches at the expense of a significant increase in the dimensionality of the dataset. This effect introduces unexpected challenges to deal with the overabundance of variables and/or noisy data. In this regard, in this paper we propose a novel encoding approach that maps mixed-type data into an information space using Shannon’s Theory to model the amount of information contained in the original data. We evaluated our proposal with ten mixed-type datasets from the UCI repository and two datasets representing real-world problems obtaining promising results. For demonstrating the performance of our proposal, this was applied for preparing these datasets for classification, regression, and clustering tasks. We demonstrate that our encoding proposal is remarkably superior to one-hot and feature-hashing encoding in terms of memory efficiency. Our proposal can preserve the information conveyed by the original data.
APA, Harvard, Vancouver, ISO, and other styles
28

Dongare, PradeepA, Sudheesh Kannan, Rakesh Garg, and SS Harsoor. "Describing and displaying numerical and categorical data." Airway 2, no. 2 (2019): 64. http://dx.doi.org/10.4103/arwy.arwy_24_19.

Full text
APA, Harvard, Vancouver, ISO, and other styles
29

Pang, Guansong, Kai Ming Ting, David Albrecht, and Huidong Jin. "ZERO++: Harnessing the Power of Zero Appearances to Detect Anomalies in Large-Scale Data Sets." Journal of Artificial Intelligence Research 57 (December 29, 2016): 593–620. http://dx.doi.org/10.1613/jair.5228.

Full text
Abstract:
This paper introduces a new unsupervised anomaly detector called ZERO++ which employs the number of zero appearances in subspaces to detect anomalies in categorical data. It is unique in that it works in regions of subspaces that are not occupied by data; whereas existing methods work in regions occupied by data. ZERO++ examines only a small number of low dimensional subspaces to successfully identify anomalies. Unlike existing frequency-based algorithms, ZERO++ does not involve subspace pattern searching. We show that ZERO++ is better than or comparable with the state-of-the-art anomaly detection methods over a wide range of real-world categorical and numeric data sets; and it is efficient with linear time complexity and constant space complexity which make it a suitable candidate for large-scale data sets.
APA, Harvard, Vancouver, ISO, and other styles
30

Jinchao Ji, Chunguang Zhou, Tian Bai, Jian Zhao, and Zhe Wang. "A Novel Fuzzy K-Mean Algorithm With Fuzzy Centroid For Clustering Mixed Numeric And Categorical Data." INTERNATIONAL JOURNAL ON Advances in Information Sciences and Service Sciences 4, no. 7 (April 30, 2012): 256–64. http://dx.doi.org/10.4156/aiss.vol4.issue7.30.

Full text
APA, Harvard, Vancouver, ISO, and other styles
31

Janeja, Vandana P., Josephine M. Namayanja, Yelena Yesha, Anuja Kench, and Vasundhara Misal. "Discovering Similarity Across Heterogeneous Features." International Journal of Data Warehousing and Mining 16, no. 4 (October 2020): 63–83. http://dx.doi.org/10.4018/ijdwm.2020100104.

Full text
Abstract:
The analysis of both continuous and categorical attributes generating a heterogeneous mix of attributes poses challenges in data clustering. Traditional clustering techniques like k-means clustering work well when applied to small homogeneous datasets. However, as the data size becomes large, it becomes increasingly difficult to find meaningful and well-formed clusters. In this paper, the authors propose an approach that utilizes a combined similarity function, which looks at similarity across numeric and categorical features and employs this function in a clustering algorithm to identify similarity between data objects. The findings indicate that the proposed approach handles heterogeneous data better by forming well-separated clusters.
APA, Harvard, Vancouver, ISO, and other styles
32

Jinyin, Chen, He Huihao, Chen Jungan, Yu Shanqing, and Shi Zhaoxia. "Fast Density Clustering Algorithm for Numerical Data and Categorical Data." Mathematical Problems in Engineering 2017 (2017): 1–15. http://dx.doi.org/10.1155/2017/6393652.

Full text
Abstract:
Data objects with mixed numerical and categorical attributes are often dealt with in the real world. Most existing algorithms have limitations such as low clustering quality, cluster center determination difficulty, and initial parameter sensibility. A fast density clustering algorithm (FDCA) is put forward based on one-time scan with cluster centers automatically determined by center set algorithm (CSA). A novel data similarity metric is designed for clustering data including numerical attributes and categorical attributes. CSA is designed to choose cluster centers from data object automatically which overcome the cluster centers setting difficulty in most clustering algorithms. The performance of the proposed method is verified through a series of experiments on ten mixed data sets in comparison with several other clustering algorithms in terms of the clustering purity, the efficiency, and the time complexity.
APA, Harvard, Vancouver, ISO, and other styles
33

Dirisinapu, Lakshmi Sreenivasareddy, Krishna Murthy Mudumbi, and Govardhan Aliseri. "Outlier Analysis of Categorical Data Using Infrequency." INTERNATIONAL JOURNAL OF COMPUTERS & TECHNOLOGY 8, no. 3 (June 30, 2013): 868–73. http://dx.doi.org/10.24297/ijct.v8i3.3397.

Full text
Abstract:
Anomalies are those objects, which will act with different behavior and do not follow with the remaining records in the databases. Detecting anomalies is an important issue in many fields. Though many methods are available to detect anomalies in numerical datasets, only a few methods are available for categorical datasets. In this work, a new method has been proposed. This algorithm finds anomalies based on infrequent itemsets in each record. These outliers are generated by Apriori property on each record values in datasets. Previous methods may not distinguish different records with the same frequency. These give same score for each record. For each record a score is generated based on infrequent itemsets which is called MAD score in this paper. This algorithm utilizes the frequency of each value in the dataset. FPOF method is used the concept of frequent itemset and otey method used infrequent itemset. But these cannot distinguish records perfectly. The proposed algorithm has been applied on Nursery dataset and Bank dataset taken from “UCI Machine Learning Repository”. Numerical attributes are excluded from Datasets for this analysis. The experimental results show that it is efficient for outlier detection in categorical dataset.
APA, Harvard, Vancouver, ISO, and other styles
34

Lee, Changki, and Uk Jung. "Context-Based Geodesic Dissimilarity Measure for Clustering Categorical Data." Applied Sciences 11, no. 18 (September 10, 2021): 8416. http://dx.doi.org/10.3390/app11188416.

Full text
Abstract:
Measuring the dissimilarity between two observations is the basis of many data mining and machine learning algorithms, and its effectiveness has a significant impact on learning outcomes. The dissimilarity or distance computation has been a manageable problem for continuous data because many numerical operations can be successfully applied. However, unlike continuous data, defining a dissimilarity between pairs of observations with categorical variables is not straightforward. This study proposes a new method to measure the dissimilarity between two categorical observations, called a context-based geodesic dissimilarity measure, for the categorical data clustering problem. The proposed method considers the relationships between categorical variables and discovers the implicit topological structures in categorical data. In other words, it can effectively reflect the nonlinear patterns of arbitrarily shaped categorical data clusters. Our experimental results confirm that the proposed measure that considers both nonlinear data patterns and relationships among the categorical variables yields better clustering performance than other distance measures.
APA, Harvard, Vancouver, ISO, and other styles
35

Indriyani, Indriyani, and M. Ihsan Alfani Putera. "Web-based Application for Classification Using Naïve Bayes and K-means Clustering (Case Study: Tic-tac-toe Game)." International Journal of Engineering and Emerging Technology 5, no. 1 (July 27, 2020): 8. http://dx.doi.org/10.24843/ijeet.2020.v05.i01.p04.

Full text
Abstract:
A database can consist of numerical and non-numerical attributes. However, several data processing algorithms, such as K-means clustering, can be used only in a dataset with numerical attributes. Data generalization by using Naïve Bayes and K-means clustering methods is usually employed WEKA (Waikato environment for knowledge analysis) application. Although the strength of WEKA lies in increasingly complete and sophisticated algorithms, the success of data mining still lies in the knowledge factor of the human implementer. The task of collecting high-quality data and knowledge of modeling and the use of appropriate algorithms is needed to guarantee the accuracy of the expected formulations. In this paper, we propose a simple web-based application that can be used like WEKA. The methodology used in this study includes several stages. The first stage is the preparation of data, which is the tic-tac-toe game dataset that is converted to CSV (comma-separated values) format. The next stage is the process of modifying data from non-numeric to numeric, specifically for clustering with the K-means algorithm. Afterward, the calculation of the distance between data is conducted and followed by data clustering. The final stage is the summary of these processes and results. From the experimental results, it was found that clustering can be done on categorical attributes that are transformed first into the numerical form using web-based applications.
APA, Harvard, Vancouver, ISO, and other styles
36

Kvålseth, Tarald O. "Coefficients of Variation for Nominal and Ordinal Categorical Data." Perceptual and Motor Skills 80, no. 3 (June 1995): 843–47. http://dx.doi.org/10.2466/pms.1995.80.3.843.

Full text
Abstract:
Various measures of variation for categorical data which have been introduced suffer from the limitation that their numerical values generally appear to be unreasonable. This may cause their use to give misleading results and poor data discrimination. In this paper, two coefficients of variation for categorical data which seem to have appropriate numerical properties are introduced. One of the coefficients is appropriate for nominal data and one for ordinal data.
APA, Harvard, Vancouver, ISO, and other styles
37

David, Gil, and Amir Averbuch. "SpectralCAT: Categorical spectral clustering of numerical and nominal data." Pattern Recognition 45, no. 1 (January 2012): 416–33. http://dx.doi.org/10.1016/j.patcog.2011.07.006.

Full text
APA, Harvard, Vancouver, ISO, and other styles
38

Saporta, G. "Data analysis for numerical and categorical individual time-series." Applied Stochastic Models and Data Analysis 1, no. 2 (1985): 109–19. http://dx.doi.org/10.1002/asm.3150010204.

Full text
APA, Harvard, Vancouver, ISO, and other styles
39

Dinh, Duy-Tai, Van-Nam Huynh, and Songsak Sriboonchitta. "Clustering mixed numerical and categorical data with missing values." Information Sciences 571 (September 2021): 418–42. http://dx.doi.org/10.1016/j.ins.2021.04.076.

Full text
APA, Harvard, Vancouver, ISO, and other styles
40

Suresh Kumar, B., Dr H.Venkateswara Reddy, Dr S.Viswanadha Raju, and G. Vijay Kanth. "Data labeling method based on Cluster similarity using Rough Entropy for Categorical Data Clustering." International Journal of Engineering & Technology 7, no. 4.6 (September 25, 2018): 68. http://dx.doi.org/10.14419/ijet.v7i4.6.20239.

Full text
Abstract:
In present research, Data mining is become one of the growing area which deals with data. Clustering is recognized as an efficient methodology in data grouping; to improve the efficiency of the clustering many researchers have used data labeling method. Labeling method works on similar data points, into the proper clusters. In categorical domain applying data labeling is not so easy when compare with numerical domain. In numeral domain it is easy to find difference between to data points, but in categorical it is not easy. Since data labeling on categorical is a challenging issue till date and it is quite complex to implement. The proposed methodology is deals on this problem. According proposed method a sample data will be taken. That sampled data further divides sliding windows, and then a normal clustering algorithm will be applied on one sliding window and divides into clusters. Rough membership Entropy function is used to find the similarity between unlabelled data points to labeled data points. The proposed methodology has two important features those are 1) The Data points will moved into their proper clusters, means the quality clusters will take places, 2) Proposed methodology will execute with high efficiency rate. In this paper the proposed methodology is applied on KDD Cup99 data sets, and the results shows appreciably more proficient than earlier works.
APA, Harvard, Vancouver, ISO, and other styles
41

LEE, SUNG-GI, and DEOK-KYUN YUN. "CLUSTERING CATEGORICAL AND NUMERICAL DATA: A NEW PROCEDURE USING MULTIDIMENSIONAL SCALING." International Journal of Information Technology & Decision Making 02, no. 01 (March 2003): 135–59. http://dx.doi.org/10.1142/s0219622003000549.

Full text
Abstract:
In this paper, we present a concept based on the similarity of categorical attribute values considering implicit relationships and propose a new and effective clustering procedure for mixed data. Our procedure obtains similarities between categorical values from careful analysis and maps the values in each categorical attribute into points in two-dimensional coordinate space using multidimensional scaling. These mapped values make it possible to interpret the relationships between attribute values and to directly apply categorical attributes to clustering algorithms using a Euclidean distance. After trivial modifications, our procedure for clustering mixed data uses the k-means algorithm, well known for its efficiency in clustering large data sets. We use the familiar soybean disease and adult data sets to demonstrate the performance of our clustering procedure. The satisfactory results that we have obtained demonstrate the effectiveness of our algorithm in discovering structure in data.
APA, Harvard, Vancouver, ISO, and other styles
42

Oh, Chi-Hyon, Katsuhiro Honda, and Hidetomo Ichihashi. "Quantification of Multivariate Categorical Data Considering Typicality of Item." Journal of Advanced Computational Intelligence and Intelligent Informatics 11, no. 1 (January 20, 2007): 35–39. http://dx.doi.org/10.20965/jaciii.2007.p0035.

Full text
Abstract:
We propose simultaneously applying homogeneity analysis and fuzzy clustering that simultaneously partitions individuals and items in categorical multivariate datasets. This objective function includes two types of memberships. One is conventional membership representing the degree of membership of each individual in each cluster. The other is an additional parameter that represents typicality of item. A numerical experiment demonstrates that our proposal is useful in quantifying categorical data, taking the typicality of each item into account.
APA, Harvard, Vancouver, ISO, and other styles
43

Dong, Bin, Songlei Jian, and Ke Zuo. "CDE++: Learning Categorical Data Embedding by Enhancing Heterogeneous Feature Value Coupling Relationships." Entropy 22, no. 4 (March 29, 2020): 391. http://dx.doi.org/10.3390/e22040391.

Full text
Abstract:
Categorical data are ubiquitous in machine learning tasks, and the representation of categorical data plays an important role in the learning performance. The heterogeneous coupling relationships between features and feature values reflect the characteristics of the real-world categorical data which need to be captured in the representations. The paper proposes an enhanced categorical data embedding method, i.e., CDE++, which captures the heterogeneous feature value coupling relationships into the representations. Based on information theory and the hierarchical couplings defined in our previous work CDE (Categorical Data Embedding by learning hierarchical value coupling), CDE++ adopts mutual information and margin entropy to capture feature couplings and designs a hybrid clustering strategy to capture multiple types of feature value clusters. Moreover, Autoencoder is used to learn non-linear couplings between features and value clusters. The categorical data embeddings generated by CDE++ are low-dimensional numerical vectors which are directly applied to clustering and classification and achieve the best performance comparing with other categorical representation learning methods. Parameter sensitivity and scalability tests are also conducted to demonstrate the superiority of CDE++.
APA, Harvard, Vancouver, ISO, and other styles
44

Mohanty, Ashima Sindhu, Krishna Chandra Patra, and Priyadarsan Parida. "Toddler ASD Classification Using Machine Learning Techniques." International Journal of Online and Biomedical Engineering (iJOE) 17, no. 07 (July 2, 2021): 156. http://dx.doi.org/10.3991/ijoe.v17i07.23497.

Full text
Abstract:
At present era, Autism Spectrum Disorder (ASD) has become one of the severe neurologically developed disorders throughout the world and early recognition can substantially get rid of this problem. The proposed work is based on the analysis of unbalanced ASD toddler dataset from UCI data repository. The work in this paper is performed in three stages. In first stage, the original data is preprocessed through converting the categorical attributes to numeric values by the process of frequency encoding followed by standardization of numeric attributes. In the second stage, the dimension of input is reduced using Principal component analysis (PCA). At the end, the classification of ASD Toddler data is performed through different machine learning classification models in two stages viz. through training parameter ε and through k-fold cross validation (k=10). The experimentation yields very high classification performance in comparison with other state-of-art approaches.
APA, Harvard, Vancouver, ISO, and other styles
45

Iqbal, Asif. "Modeling Milling Process Using Artificial Neural Network." Advanced Materials Research 628 (December 2012): 128–34. http://dx.doi.org/10.4028/www.scientific.net/amr.628.128.

Full text
Abstract:
Machining processes, such as milling, are considered to be too complex to be modeled accurately by using analytical or even numeric means due to involvement of various control parameters, some of them highly vague and imprecise. Such situation calls for application of nonconventional methods for modeling the responses of interest with acceptable degree of accuracy. In this work, a computational intelligence tool, possessing quick learning ability, has been used for modeling and predicting tool’s flank wear and workpiece surface roughness in milling of cold work tool steel. Six numeric and two categorical input parameters were used in the artificial neural network model. 116 data sets were used for training the network, while 13 were used for testing. Both the responses were modeled with acceptable degree of accuracy.
APA, Harvard, Vancouver, ISO, and other styles
46

Andreopoulos, Bill, Aijun An, and Xiaogang Wang. "Bi-level clustering of mixed categorical and numerical biomedical data." International Journal of Data Mining and Bioinformatics 1, no. 1 (2006): 19. http://dx.doi.org/10.1504/ijdmb.2006.009920.

Full text
APA, Harvard, Vancouver, ISO, and other styles
47

Taylor, James A., and Budiman Minasny. "A protocol for converting qualitative point soil pit survey data into continuous soil property maps." Soil Research 44, no. 5 (2006): 543. http://dx.doi.org/10.1071/sr06060.

Full text
Abstract:
Vineyard soil surveys to date have focused on presenting soil data in point rather than raster format. This is due to the recording of both numeric and categorical variables. A protocol, including a lookup table to transform linguistic texture values into particle size distributions, to convert point data into continuous raster maps is presented. The resulting maps are coherent with vineyard knowledge and provide a strong spatial representation of soil variability within the vineyard. Validation with an independent dataset shows an error of ~10% in prediction; however, some of this can be attributed to errors in the geo-rectification of old data. Raster maps allow the survey data to be incorporated into computer systems to better model vineyard and irrigation designs and are more readily used in day-to-day vineyard management decisions.
APA, Harvard, Vancouver, ISO, and other styles
48

Wangchamhan, Tanachapong, Sirapat Chiewchanwattana, and Khamron Sunat. "Efficient algorithms based on the k-means and Chaotic League Championship Algorithm for numeric, categorical, and mixed-type data clustering." Expert Systems with Applications 90 (December 2017): 146–67. http://dx.doi.org/10.1016/j.eswa.2017.08.004.

Full text
APA, Harvard, Vancouver, ISO, and other styles
49

Kim, Juhyun, Yiwen Zhang, Joshua Day, and Hua Zhou. "MGLM: An R Package for Multivariate Categorical Data Analysis." R Journal 10, no. 1 (2018): 73. http://dx.doi.org/10.32614/rj-2018-015.

Full text
APA, Harvard, Vancouver, ISO, and other styles
50

Vahldiek, Kai, Libing Zhou, Wenfeng Zhu, and Frank Klawonn. "Development of a data generator for multivariate numerical data with arbitrary correlations and distributions." Intelligent Data Analysis 25, no. 4 (July 9, 2021): 789–807. http://dx.doi.org/10.3233/ida-205253.

Full text
Abstract:
Artificial or simulated data are particularly relevant in tests and benchmarks for machine learning methods, in teaching for exercises and for setting up analysis workflows. They are relevant when real data may not be used for reasons of data protection, or when special distributions or effects should be present in the data to test certain machine learning methods. In this paper a generator for multivariate numerical data with arbitrary marginal distributions and – as far as possible – arbitrary correlations is presented. The data generator is implemented in the open source statistics software R. It can also be used for categorical variables, if data are generated separately for the corresponding characteristics of a categorical variable. Additionally, outliers can be integrated. The use of the data generator is demonstrated with a concrete example.
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

To the bibliography