Tesi: "Feature selection"

1

Zheng, Ling. "Feature grouping-based feature selection". Thesis, Aberystwyth University, 2017. http://hdl.handle.net/2160/41e7b226-d8e1-481f-9c48-4983f64b0a92.

Testo completo

Abstract (sommario):

Feature selection (FS) is a process which aims to select input domain features that are most informative for a given outcome. Unlike other dimensionality reduction techniques, feature selection methods preserve the underlying semantics or meaning of the original data following reduction. Typically, FS can be divided into four categories: filter, wrapper, hybrid-based and embedded approaches. Many strategies have been proposed for this task in an effort to identify more compact and better quality feature subsets. As various advanced techniques have emerged in the development of search mechanisms, it has become increasingly possible for quality feature subsets to be discovered efficiently without resorting to exhaustive search. Harmony search is a music-inspired stochastic search method. This general technique can be used to support FS in conjunction with many available feature subset quality evaluation methods. The structural simplicity of this technique means that it is capable of reducing the overall complexity of the subset search. The naturally stochastic properties of this technique also help to reduce local optima for any resultant feature subset, whilst locating multiple, potential candidates for the final subset. However, it is not sufficiently flexible in adjusting the size of the parametric musician population, which directly affects the performance on feature subset size reduction. This weakness can be alleviated to a certain extent by an iterative refinement extension, but the fundamental issue remains. Stochastic mechanisms have not been explored to their maximum potential by the original work, as it does not employ a parameter of pitch adjustment rate due to its ineffective mapping of concepts. To address the above problems, this thesis proposes a series of extensions. Firstly, a self-adjusting approach is proposed for the task of FS which involves a mechanism to further improve the performance of the existing harmony search-based method. This approach introduces three novel techniques: a restricted feature domain created for each individual musician contributing to the harmony improvisation in order to improve harmony diversity; a harmony memory consolidation which explores the possibility of exchanging/communicating information amongst musicians such that it can dynamically adjust the population of musicians in improvising new harmonies; and a pitch adjustment which exploits feature similarity measures to identify neighbouring features in order to fine-tune the newly discovered harmonies. These novel developments are also supplemented by a further new proposal involving the application to a feature grouping-based approach proposed herein for FS, which works by searching for feature subsets across homogeneous feature groups rather than examining a massive number of possible combinations of features. This approach radically departs from the traditional FS techniques that work by incrementally adding/removing features from a candidate feature subset one feature at a time or randomly selecting feature combinations without considering the relationship(s) between features. As such, information such as inter-feature correlation may be retained and the residual redundancy in the returned feature subset minimised. Two different instantiations of an FS mechanism are derived from such a feature grouping-based framework: one based upon the straightforward ranking of features within the resultant feature grouping; and the other on the simplification for harmony search-based FS. Feature grouping-based FS offers a self-adjusting approach to effectively and efficiently addressing many real-world problems which may have data dimensionality concerns and which requires semantic-preserving in data reduction. This thesis investigate the application of this approach in the area of intrusion detection, which must deal in a timely fashion with huge quantities of data extracted from network traffic or audit trails. This approach empirically demonstrates the efficacy of feature grouping-based FS in action.