A Novel Classifier Based on Genetic Algorithms and Data Importance Reformatting

Author

Alkhayyat، Aysha Khaled

Linked Agent

Hewahi, Nabil, Thesis advisor

Date Issued

2023

Language

English

Extent

[1]، 12، 65، [14] صفحة

Place of institution

Sakhir, Bahrain

Thesis Type

Thesis (Master)

English Abstract

Abstract: Machine learning (ML) has attracted substantial attention and become progressively more popular in recent years due to its ability to make predictions and decisions based on data. The increasing use of ML in critical fields such as medical diagnosis and fraud detection emphasizes the need to continuously enhance its performance and eventually result in a better decision-making. This can be achieved through the use of optimization methods (OM). However, sometimes the performance of ML algorithms is limited by issues related to the nature of the data which can hinder its performance. Therefore, a novel classification algorithm that is based on Data Importance (DI) reformatting and Genetic Algorithms (GA) named GADIC is proposed in this research to overcome these issues and improve the efficacy and robustness of ML algorithms. The aim of this research is to evaluate the impact of the proposed algorithm on the performance of the classifiers and compare it with other conventional classification algorithms to measure its ability to improve the classifiers' performance. The proposed algorithm comprises three phases which are data reformatting phase which depends on DI concept, training phase where GA is applied on the reformatted training set, and testing phase where the instances of the reformatted testing set are being averaged based on similar instances in the training set. The proposed algorithm has been tested on five existing ML classifiers which are Support Vector Machine (SVM), K-Nearest Neighbour (KNN), Logistic Regression (LR), Decision Tree (DT), and Naïve Bayes (NB). All were evaluated using seven open-source UCI ML repository and Kaggle datasets which are Cleveland heart disease, Indian liver patient, Pima Indian diabetes, employee future prediction, telecom churn prediction, bank customer churn, and tech students. In terms of accuracy, the results showed that, with the exception of approximately 1% decrease in the accuracy of NB classifier in Cleveland heart disease dataset, GADIC significantly enhanced the performance of most ML classifiers using various datasets. In addition, KNN with GADIC showed the greatest performance gain when compared with other ML classifiers with GADIC, with an average increase of 16.79%, followed by SVM with an average increase of 9.03%. LR had the lowest improvement with an average increase of 5.96%.

Member of

College of Science

Identifier