Toward Developing a Mechanism for Detecting Arabic Spam on Twitter Platform
وكيل مرتبط
Mahmood, Amjad, مشرف الرسالة العلمية
اللغة
الأنجليزية
مدى
12, 93, [5] pages
الموضوع
مكان المؤسسة
Sakhir, Bahrain
نوع الرسالة الجامعية
Thesis (Master)
الجهه المانحه
UNIVERSITY OF BAHRAIN, College of Information Technology
الوصف
Abstract
The popularity of Twitter in the Arab world has been unfortunately used to disseminate tailored
spam tweets targeting Arabic speaking users. Those spam tweets are annoying and threatening the user experience in a platform that is designed to typically contains only valuable and real-time tweets generated by legitimate users. Problem gets more chronic when spam tweets are hash-tagged with a trending topic hence every Twitter user browsing the trending topic is exposed to that spam.
In this research, we propose detecting spam tweets by analysing the tweet structure and
mining text for spam words/phrases. To evaluate the effectiveness of this approach, we looked
into the spam detection process as a binary classification problem and addressed it by using
supervised Machine Learning (ML) classifier. We first collected and manually labeled a dataset
of 14849 Arabic tweets where all are tagged with topics trending in the Arabian Gulf countries.
Then, we preprocessed these labeled tweets using natural language processing techniques and
extracted relevant features to distinguish between spam and non-spam tweets. Among these
features are spam keywords which were extracted from the spam tweets according to their
term frequency times inverse document frequency (tf.idf) values and using different n-gram
representation models. Furthermore, to figure out the ideal representation of these features to
best detecting spam, we investigated the impact of dimensionality of features under different
representation models on performance of four standard ML classifiers, namely, Support Vector
Machine (SVM), Decision Tree (DT), K-Nearest Neighbor (KNN) and Random Forest (RF).
In addition, we evaluated the performance of the classifiers with 10-fold cross-validation and
on out-of-sample dataset. We also stressed the importance of applying stemming on Arabic
text before mining it, and examined the impact of considering even distribution of categories
of interest when training and testing the classifiers.
The experimental results disclose the effectiveness of the proposed methodology in distinguishing spam from non-spam tweets even on out-of-sample dataset. These results suggest that SVM and RF are the best classifiers to perform the task compared to DT and KNN when evaluating with both: 10-fold cross validation and out-of sample dataset, and under different experimental settings. The results show that the best classifiers detect spam tweets on the out-sample-dataset with recall of 0.59, precision of 0.91 and F1-measure of 0.71 for SVM, and recall of 0.65, precision of 0.67 and F1-measure of 0.66 for RF which has the advantage of
being faster to perform the testing than SVM, which is important measure to consider for in the
spot detection of spam tweets.
The popularity of Twitter in the Arab world has been unfortunately used to disseminate tailored
spam tweets targeting Arabic speaking users. Those spam tweets are annoying and threatening the user experience in a platform that is designed to typically contains only valuable and real-time tweets generated by legitimate users. Problem gets more chronic when spam tweets are hash-tagged with a trending topic hence every Twitter user browsing the trending topic is exposed to that spam.
In this research, we propose detecting spam tweets by analysing the tweet structure and
mining text for spam words/phrases. To evaluate the effectiveness of this approach, we looked
into the spam detection process as a binary classification problem and addressed it by using
supervised Machine Learning (ML) classifier. We first collected and manually labeled a dataset
of 14849 Arabic tweets where all are tagged with topics trending in the Arabian Gulf countries.
Then, we preprocessed these labeled tweets using natural language processing techniques and
extracted relevant features to distinguish between spam and non-spam tweets. Among these
features are spam keywords which were extracted from the spam tweets according to their
term frequency times inverse document frequency (tf.idf) values and using different n-gram
representation models. Furthermore, to figure out the ideal representation of these features to
best detecting spam, we investigated the impact of dimensionality of features under different
representation models on performance of four standard ML classifiers, namely, Support Vector
Machine (SVM), Decision Tree (DT), K-Nearest Neighbor (KNN) and Random Forest (RF).
In addition, we evaluated the performance of the classifiers with 10-fold cross-validation and
on out-of-sample dataset. We also stressed the importance of applying stemming on Arabic
text before mining it, and examined the impact of considering even distribution of categories
of interest when training and testing the classifiers.
The experimental results disclose the effectiveness of the proposed methodology in distinguishing spam from non-spam tweets even on out-of-sample dataset. These results suggest that SVM and RF are the best classifiers to perform the task compared to DT and KNN when evaluating with both: 10-fold cross validation and out-of sample dataset, and under different experimental settings. The results show that the best classifiers detect spam tweets on the out-sample-dataset with recall of 0.59, precision of 0.91 and F1-measure of 0.71 for SVM, and recall of 0.65, precision of 0.67 and F1-measure of 0.66 for RF which has the advantage of
being faster to perform the testing than SVM, which is important measure to consider for in the
spot detection of spam tweets.
ملاحظة
Title on cover :
تطوير طريقة للكشف عن التغريدات العربية التطفلية
تطوير طريقة للكشف عن التغريدات العربية التطفلية
المجموعة
المعرف
https://digitalrepository.uob.edu.bh/id/2cef4697-0dd2-4622-9ece-621f79e2ce7e