Performance Evaluation of User-Behaviour Techniques of Web Spam Detection Models

Web spam detection is a critical issue in today’s rapidly growing usage of the Internet and the World Wide Web. The upsurge of web spam has significantly deteriorated the Quality of Services (QoS) of the World Wide Web. The degeneration of the quality of search engine results has given rise to researches on the detection of spam pages efficiently and accurately. Existing user-behaviour oriented web spam detection models employed the content-based, link-based and other features of webpages for classification of web spams. These user-behaviour techniques either implemented singly or combined has achieved good detection performance. However, the effectiveness of these features in identifying Web spams correctly needs to be determined. In this study, predictive web spam detection models that employed all related user-behaviour features of webpages were developed and evaluated. The content, link, and obvious-based features datasets were collected from an online repository. Relevant features were extracted using an improved Filter-based method. Six user-behaviour related features extracted from the datasets were used to combine the datasets to generate all possible subset of feature space required, such that 7 new datasets were generated for the study. Multi-Layer Perceptron (MLP) approach was adopted as a classifier for each of the identified features. Python Machine Learning Library was used to simulate the models using percentage splits of 60/40%, 70/30% and 80/20% ratio for training/testing dataset and the performances were evaluated using accuracy, True Positive (TP) rate, False Positive (FP) rate and precision as metrics. The result showed that for the majority of the datasets the formulated models have shown an increase in efficiency after feature selection. The MLP classifier was able to achieve the best result of 66.0% accuracy when the link-based dataset was used with feature selection. The study concluded that link-based features of a user is sufficient and effective for the detection of web spams.


Introduction
Web Spams are unsolicited, unwanted email, ads, links, contents, sent indiscriminately, directly or indirectly by a sender having no current relationship with the recipient, or an unsolicited commercial mail usually sent to a large group of recipients at the same time by service providers such as Internet Service providers (ISPs) (Ndumiyana et al., 2013). Web spams are usually popped up during the search for information on the web, these ads or junks are developed by spammers to attract web users and cause a search engine to produce wrong information when surfing web or websites (Jindal and Liu, 2008). Some of the nefarious act posed by web spam includes subverting the ranking algorithms of web search engines and cause them to rank search results higher than they would otherwise (Najork, 2009)

etc.
This spam situation is so disruptive and infuriating that search engines, web users, and email receivers spend a lot of time trying to combat it since it leads to the loss of lots of resources, finance, and cost. commercial search engines treat their precise set of spam-prediction features as extremely proprietary, and features (as well as spamming techniques) evolve continuously as search engines and web spammers are engaged in a continuing arms race (Manne and Wright, 2011).
There have been different classification problems amenable to machine learning techniques to combat this deadly web masquerade. Spam classifiers and filterers are been created in search to combat web spam, but due to the smartness of the spammers, they always develop new ways to manipulate their way into search engines . Spam classifiers and filterers take a large set of diverse features as input, including content-based features, link-based features, DNS and domain-registration features, and implicit user feedback. However, it was observed that the majority of the classifiers focused on content and/or link-based features of the webpages for detecting web spam. Many a time, most of these algorithms are not pro-active and cannot withstand the pressure from spam pages, because if the spams links, emails, web pages are blocked by the filterer's or classifiers for the first time, within no time the spam will replicate itself and hit the algorithm for as many times until it makes its way through.
Meanwhile, Web users' interests, navigational actions, and preferences have gained importance in web spam detection, since web users contribute greatly to the sharing of spam pages, spam sites and also making spam pages or sites to gain more relevance. Most users who visit the web are not abreast of spam pages or sites, because of these reasons their navigate Pattern is always a problem by clicking on every page that come up with interesting topics. The navigational patterns of a user are stored within web access logs, and these contain many different data in these files. In order to understand the user behaviour through these data that are stored in the web access logs, some of the existing studies reviewed the extraction of user behaviour data from web-pages and web access log files but adopted the use of expert knowledge for the classification of web spam, while some employed web usage mining concept, which consists of the application of machine learning techniques over data originated in the Web (Web data) for automatic extraction of behavioural patterns from Web users (Román et al., 2014).
Implementation of a web spam detection model using user behaviour-related data has greatly helped improve the prevention of illegal and unsolicited access to web spam pages from a website that is a spam . The availability of web access log files provided a means of understanding the interaction of users with web pages to ascertain information about the status of webpages visited. Concerning user behavior analysis, there have been several challenges in combating web spam in search engines and online social networks. The challenge of detecting newly appearing web spams called Zero-day spams. For this reason, many times the users of the World-Wide-Web still encounter embarrassment from spam pages, adds, pop up, links redirections, false news, porn site, etc. In the end, most people might not likely get what they are searching for from the web, due to the deceitful natures of the spamming community. Thus, there is a need to determine which of the user behaviour techniques is effective for detecting newly appearing web spams.
In this study, an attempt was made to evaluate the effectiveness and robustness of a web spam detection model that employed the existing content, link, and obvious-based features complemented with other user-behaviour features. The model with the best performance will be selected for the detection of web spam.
The rest of the paper is organized as follows: related works were discussed in section2 while section3 discusses the methodology used to solve the identified problem. Section 4 discusses the simulation process and results. Section 5 offers conclusions and recommendations for future work.

Related Works
Research on web spam detection has been ongoing for over a decade. Several web spam detection algorithms have been developed to identify different type of spam that appears on the Web. Spirin and Han (2011), Castillo and Davison (2011), Kohle and Bhukte (2015, gave a comprehensive survey on the state of the art of the different techniques used for web spam detection. With respects to user behavior analysis, web spams are detected either by analysing the content-based features of the web page contents. |Some studies adopted the use of the content-based features to detect web spams (Fetterly et al., 2004;Gyongyi and Garia-Molina, 2005;Mishne, 2005;Ntoulas et al., 2006;Svore, 2007;Sydow, 2007;Liu et al., 2008;Piskorski, 2008;Erdelyl et al., 2009;Awad and Elseoufi, 2011;Erd´elyi, 2011;Iqbal and Abid, 2015;Rao et al., 2016;Al-Zoubi et al., 2017). Some studies adopted the use of hyperlink structure analysis (Davison, 2000;Amitay et al., 2003;Caverlee and Liu, 2007;Bencz´ ur et al., 2006;Baeza-Yates et al., 2006;Becchetti et al., 2006;Niu et al., 2018;Hochbaum et al., 2019). Some studies adopted the use of combinations of content-based and link-based features for Web Spam detection (Gyongyi et al., 2004;Wu and Davison, 2006;Castillo et al., 2007;Liu et al., 2011). Whereas, some study adopted the use of other features like click-based and posting-based features for Spam URL Detection (Wei et al 2012;Cao and Caverlee, 2015) All these user-behaviour oriented algorithms showed a competitive performance in detection, maximize the accuracy and minimize the false positives rates, which demonstrates the successful results of many researchers. Despite all these successes, spam is still constantly evolving and affects many people and businesses still negatively. Thus, the need to evaluate the effectiveness of the user behaviours features to identify which one works best in detecting zero-day spams.  Figure 1 shows the conceptual diagram of the study and the methods adopted to achieve the objectives of the study are as follows:

Methodology
Collection of data from an online repository.
(b) Identification and analysis of the web-usage features and the user-behaviour features required for assessing web-pages.
(c) Formulation of the spam detection model for each of the identified features in the dataset. (d) Simulation of the models developed from these datasets using percentage splits of 60/40 and 70/30, 80/20 percentage for training/testing set selected from data collected.
(e) Performance evaluation of the models using accuracy, true positive (TP) rate, false positive (FP) rate and precision to select the most appropriate user behavior technique for web spam detection.

Data Collection and Analysis
A web spam dataset called the UK-2006-Web-Challenge Data was collected from the Web Spam Challenge website. It contained 3 classes of the web-usage feature-based dataset, namely: content-based features, linkbased features and obvious features. The distribution of the target class used to describe each host that was assessed are as shown in

Features Extraction Process
Filter-based feature selection (FS) methods were employed to determine the relevant features from the datasets, FS methods define relevance by identifying the attributes that are more correlated with the target class (spam or non-spam) and they are also less computationally expensive. A backward elimination technique which began by selecting all the initially identified feature set in the dataset and evaluates their accuracy using a classifier was applied. The process is depicted in Figure 2. This process was repeated for every possible subset of features in the dataset by progressive elimination of features until an empty space of features. The feature set evaluated with the highest accuracy is returned as the most relevant set of features that are required to improve the performance of the proposed spam detection model.
The dataset containing the 3 web-usage features were combined to generate seven (7) possible permutations of the datasets, the features of the datasets were used as a basis for combining the dataset to generate all possible subset of feature space required for the model development. The datasets were combined such that 3 datasets consisted of sets of content-based only, obvious only and link-based only, 3 dataset consisted of sets of content and link-based, content and obvious, and obvious and link-based while 1 dataset consisted of the set of contentbased, link-based and obvious features as shown in Table 2 making a total of 7 datasets adopted for this study. Figure 3 shows a description of the seven (7) various feature class-based dataset generated from the three (3) dataset collected.
Also, the user-behaviour features (proposed by Liu et al., 2012) were extracted from the initial features collected from the web-usage features. They are the six user-behaviour features that can be adopted for separating spam pages from ordinary ones. The first five are from user-behaviour patterns, and the last feature is a link-analysis feature extracted from a user-browsing graph that is also constructed with users' Web access log data. They are described as follows:   Table 2 shows the description of the classification system that was used to identify each user behaviour feature adopted for Web Spam classification.

Model Formulation
The predictive model adopted for this study was the Multi-Layer Perceptron (MLP) as described by Idowu et al., (2019). The MLP is composed of three (3) main layers, namely: the input, hidden and output layer. The MLP consisted of n input layers which were proportional in value to the number of features identified in the dataset presented. The hidden layer consisted of a layers of which consisted of neuron which received input from the input layers and produced outputs via an activation function which was used to produce outputs and propagated to another neuron in subsequent layers. The output layer consisted of two (2) neurons which represented the target class for identifying spam and non-spam web hosts from the dataset.
The formulated predictive model for web spam detection using the features selected is presented in Figure 4. A mapping function was used to express the process of model formulation from the feature space to the output space. The training dataset S which consisted of the initial features identified at the point of data identification and collection is represented by , where i is the number of features existing in the original dataset of web hosts, and consists of the features relevant for predicting web spam, such that: . The process of feature selection is represented by the mapping function, F in equation (6).
Such that: are the original set of attributes collected and are the relevant features selected by the feature selection method. Following the process of feature selection, the new dataset belongs such that k is the number of web hosts records collected in the original dataset. If n datasets were selected for training the predictive model using a supervised machine learning to formulate the model using the relevant variables using the mapping in equation (7). represented the set of attributes, j for web hosts record, k and is the target class (spam or nonspam) of record, k. Therefore, the supervised machine learning algorithm considered in this study was expected to determine the best fit for (the set of all possible models) based on the minimization of the cost function defined according to equation (8).

(8)
Such that: is a set and are the actual and predicted values of the target class respectively. Since, the problem is a classification problem that would involve the identification of a value the value of 0 implied correct while the value of 1 implied incorrect classification. Hence, the ability to classify correctly the detection of web spam is determined according to the cost function defined in equation (9).

Results and Discussion
The detailed results of this study are as follows:

Data Analysis results
The data collected contained 3999 records of web pages which were assessed as spam and non-spam pages based on the user-behaviour scores alongside the features identified from the 3 classes of datasets collected. The descriptions of the feature Sets of the 3 Datasets are treated offline in this paper. The dataset was used as a basis of extracting six (6) user behaviour related features for the classification of each web host as either Spam or non-Spam. This approach was based on the principle of the Wisdom of the Crowds which focused on using the interaction of users with the webpages to determine the nature of the webpages visited based on the characteristic nature of the six (6) user-behaviour features identified. The results of the classification of the three (3) dataset collected for this study are presented in Table 3.

Features Selection Result
The process of feature selection was repeated for all the 7 set of the dataset with the most relevant feature set for each dataset identified as shown in Table 4. Table 5 shows the presence of the number of features in each dataset alongside the number and the proportion of features selected. The results showed that 49.5%, 50.7%, 33.3%, 49.6%, 49.5%, 49.3%, 49.6% features were selected for content-based, link-based, obvious-based, content and link-based, content and obvious-based, link and obvious-based, Content, Link and Obvious datasets respectively.

Model Simulation and Evaluation Result
The proposed model was simulated using the Python Machine Learning Library. The Seven (7) datasets were split into training and testing datasets according to a proportion of 60%/40%, 70%/30%, and 80%/20%. Table 6 shows the distribution of the target classes (spam and non-spam) within each proportion of the training and testing dataset collected for this study. Using the full features set and the selected features of the 7 datasets, a total of 14 datasets were subjected to a training and testing scheme of 60/40, 70/30 and 80/20 process. Thus, 42 simulations were carried out such that there were 14 simulations for each training scheme. The results of the simulations are presented for each training and testing scheme as follows: (i.) Using 60/40 percentage scheme Table 7 and Figure 5 presents the results of the simulation of the proposed classification model using the 60/40 percentage split scheme performed on the 7 datasets using the full feature set and relevant selected feature set containing 188 records in the testing dataset. It was shown that the MLP classification model for web spam detection had the best classification result using the link-based dataset with the relevant features selected using the feature selection process.
(ii.) Using a 70/30 percentage scheme Table 8 and Figure 6 presents the results of the simulation of the proposed classification model using the 70/30 percent split scheme performed on the 7 datasets using the full feature set and relevant selected feature set containing 141 records in the testing dataset. It was shown that the MLP classification model for web spam detection had the best classification result using the link-based dataset with the originally identified features (without FS).

(iii.)
Using the 80/20 percentage scheme Table 9 and Figure 7 presents the results of the simulation of the Proposed classification model using the 80/20 percent split performed on the 7 datasets using the full feature set and relevant selected feature set containing 94 records in the testing dataset. It was shown that the MLP classification model for web spam detection had the best classification result using the link and obvious-based dataset with the originally identified features (without FS).
From the evaluation results, it was shown that the model with the best overall performance was identified as the multi-layer perceptron (MLP) classifier that was modelled using the dataset containing the initially identified link-based features alone.      Figure 8 shows the results of the TP rate, FP rate and precision of the spam (ham) and spam web-hosts for the MLP classifiers with the best performance for each percentage-split training scheme. The results showed that the MLP classifier using the 70/30 percentage split scheme (without FS) had the highest TP rate for non-spam webhosts and moderate for spam web-hosts. The results also showed that the MLP classifier using the 70/30 percentage split (without FS) had a moderate FP rate for non-spam web-hosts and lowest for spam web-hosts.
In summary, the results of the study showed that the feature selection algorithm adopted for this study selected 50% of the initially identified features as relevant. The simulation results showed that the MLP classification model for web spam detection had the best classification perfomance using the link-based dataset with the relevant features selected using the 60/40 and 80/20 percentage split, and using the link-obvious based dataset with the originally identified features for the 70/30 percentage split. The result showed the accuracy of the MLP with and without features selection using the link and obvious-based dataset showed an accuracy of 63.8% however using content and obvious-based dataset showed an accuracy of 48.9%, and content and link-based dataset with the accuracy of 55.3%. Using the three combined features consisting of content, link and obviousbased datasets with and without features selection showed an accuracy of 56.4%. Using the isolated features dataset for simulation by the MLP with and without feature selection, the content-based dataset showed an accuracy of 56.4%, link-based dataset showed an accuracy of 61.7% which was better than content-based dataset while the obvious-based dataset showed an accuracy of 58.5% which also outperformed the content-based dataset.

Conclusion
This study evaluated the accuracy performances of different user-behaviour features used in modelling web spam detection models. The study identified the user-behaviour features from a selected dataset and then developed a classification model for the detection of spam websites using Multi-Layer Perceptron (MLP). The classification models were simulated and validated. The results showed that the MLP classifier using the 70/30 percentage split scheme (without FS) had the highest precision for non-spam web-hosts and spam web-hosts. The study proved that the Link-based features integrated into a classification model will facilitate more effective detection of spam websites as compared to content-based features. This is justified since content-based features only contain details about the content of the webpages, however, the link-based features reveal information about the connection of pages alongside the navigation of users through these paths. Therefore, users are likely to spend lesser time on a website which also implies lesser movement through the path of websites created by the hyperlinks. This will ensure that unsuspecting users are not directed to the contents of such websites thus mitigating the risk associated with visiting such spam websites. The timeliness problem is still a challenge. Suggested future works include evaluation of the time factor of detecting spams at an early stage using the identified user-behaviors features.