UTAR Institutional Repository

Feature selection by mutual information: robust ranking on high- dimension low-sample-size data

Chin, Fung Yuen (2024) Feature selection by mutual information: robust ranking on high- dimension low-sample-size data. PhD thesis, UTAR.

[img]
Preview
PDF
Download (2952Kb) | Preview

    Abstract

    Feature selection is a process of selecting a group of relevant features by removing unnecessary features for use in constructing the predictive model. The current benchmark for the data set is obtained by including all the features, such as redundancy and noise. Therefore, for this research, an optimal baseline for the data set will be proposed using the feature ranking method. To achieve this optimal baseline, a total number of features will be obtained at the same time to serve as the guideline on the number of features needed in a feature selection method. In addition, the high dimensional data which increases the difficulty on the features selection due to the curse of dimensionality. To overcome this problem, a robust feature selection algorithm, named ranked mutual information with support vector machine (rMI-SVM) can be applied on the data with missing value regardless of the linearity of the data set, as it does not require additional parameter or preset on the number of features needed. The features selected by rMI-SVM can avoid overfitting as the chosen candidate feature will provide new information to the predictive model. The receiver operating characteristic curve has been plotted to show the sensitivity of the model built by rMI-SVM compared to the regression method under the same number of features. Also, the Z- score graph was plotted to confirm that the features chosen by rMI-SVM were not selected by chance. The experimental results show that the proposed method can select a compact subset of features that can perform better than the benchmark of the data set and the optimal baseline proposed in this study. The biological meaning of the selected features confirmed that the selected features are related to the relevant disease.

    Item Type: Final Year Project / Dissertation / Thesis (PhD thesis)
    Subjects: H Social Sciences > HA Statistics
    Q Science > Q Science (General)
    Divisions: Institute of Postgraduate Studies & Research > Lee Kong Chian Faculty of Engineering and Science (LKCFES) - Sg. Long Campus > Doctor of Philosophy in Science
    Depositing User: Sg Long Library
    Date Deposited: 19 Jan 2025 09:59
    Last Modified: 19 Jan 2025 09:59
    URI: http://eprints.utar.edu.my/id/eprint/7067

    Actions (login required)

    View Item