الملخص الإنجليزي
Abstract :
Decision Trees have been applied widely for classification in many fields such as finance, marketing, engineering, and medical diagnosis. Due to the wide range of applications, it is crucial to fully understand the various aspects of decision trees including the different type of costs associated with the classification task. It is also crucial to understand the relationship between the classifier's accuracy and cost, as balancing the two is a major concern nowadays in many fields such as the medical diagnosis field.
This research explores the way of solving the accuracy/cost trade-off in a decision tree classifier to get a low cost and an accurate classifier. It aims at modeling the relationship between the classification accuracy and the cost in a decision tree and determining the optimal balance between the two that gives a satisfactory accuracy at the lowest possible cost.
During the research, different pruning methods have been utilized to control the level of tree pruning and improve the classification accuracy in a decision tree classifier. Different pruning parameters have to be used carefully as they may lead to noticeable difference in the classifier's performance. Therefore, this research tested various pruning parameters options in the standard decision tree algorithm, and then the Pareto dominance approach was used to select the best setting as it is known for its powerfulness and effectiveness in addressing different data mining problems.
The experimental results indicate the efficiency of using different post-pruning and pre- pruning methods in reducing the average total classification cost significantly with a very slight reduction in the accuracy in some cases. The resulting model between the classification accuracy and the average total cost was proportional inverse. When the proposed solutions compared to the standard decision tree algorithm with its default pruning settings, the reduction in average total classification cost reached 21.14%, 19.79%, and 5.27% in breast cancer, heart disease, and thyroid disease datasets, respectively. The resulting accuracy rate was reduced only by 1.6% in breast cancer dataset. In heart disease dataset, the resulting classification accuracy was increased by 6.68% and the same accuracy rate was obtained in thyroid disease dataset.