Cyber-Security: The Use of Big Data Analytic Model for Network Intrusion Detection Classification

Cybersecurity is seen as a major player in the protection of Internet-connected systems, including hardware, software and data, from cyberattacks and other malicious crimes in today’s densely connected world-Internet of Things (IoTs). The divers challenge facing Internet users as private and business entities is being advocated as not enough hinderance to seamless interfacing of Mobile Computing and Internet Applications presently making waves. Technology such as Intrusion Detection Systems (IDS) application into cyber-security is an evolving computing mechanism designed as a counter-measure to incessant network threats and intruders. It is one of most reliable pro-defensive tools and has gained significance over time. Meanwhile network traffic data being generated within the context of enormous Internet users requires the application of big data analytical tools for its analysis. This paper, therefore, employs the use of big data analytical tools with its machine learning algorithm on an open-source data set-KDD’99. The full data set was used in the analysis. Predictive model was built in less than 5 minutes time with 99.91% prediction accuracy. Computational challenge and only 10% data set usage, which could only be accounted for in previous research were overcome. Therefore, IDS could be better designed with integration of this classification model result.


rate.
Many classified statistical and machine learning models have been proposed and implemented on intrusion detection analysis and system design. Results from these works have also been surveyed with the KDD'99 result [3], [8], [11].
[3] did a comparative analyzing of various machine learning tools on KDD' 99 benchmark dataset. Two types of experiment were carried out in their paper. One with the full attributes of 41 and the other with 11. The work claimed that considerable cutback in resources were achieved using 11 reduced features of the dataset. [12] provided a tabular reviewed of scheme of well-known machine learning. The work reveals some of the pros and cons of ML algorithm and fuzzy logic. They opined that it was difficult to choose a particular method to implement an intrusion detection on the other. A more comprehensive survey on IDS was carried out in [8].
[1] applied JRIP and Reptree algorithms from WEKA (Waikato Environment for Knowledge Analysis) on the extracted KDD'99 dataset. User to Root (U2R) and Remote to local (R2L) were the major attack considered. The argument was based on the fact that mining result mostly show a less consideration for the two when combined with the rest attacks in the full dataset. Meanwhile, these two attacks were considered most dangerous [4]. The algorithms used performed better with performance metrics and rules that can be implemented in a real IDS design.
An improved Naive Bayes algorithm based on Principal Component Analysis (PCA) was proposed by [13]. The PCA was used to obtain new set of attributes serving as input for the Naïve Bayes classifier. Improved weighted Naive Bayes classification were obtained showing a better performance of the approach adopted.
Other area of focus of IDS has also be on a single event stream detection. In this regard, network traffic is directed to monitor a server host or access logs produced by a server application. With this approach, a state full model has not be considered to analyze different events streams thereby providing an integrated state full analysis of multiple event streams [7].
Another major area of concern is that several research works only considered a subset (10%) out of the full KDD'99 dataset. The data set is of the size 743MB containing 4, 898, 431 records, with its testing data set in which many classical running techniques might not be able to handle in terms of time and space complexity of their algorithms. Aside from this, the other challenge that most study on the intrusion defection classifier had, was that training carried out on the small subset of the data did not represent the network pattern training well enough. This has led to identifying attack patterns from normal ones as more false positives are generated.
Big data analytics, a current trend in computing serving as umbrella for machine learning, statistical and visualizing tools for large data set, with its 5Vs acronyms (Volume, Velocity, Variety, Veracity and Value), is employed as a phenomenal solution to the problem of dealing with relative and very large dataset. Coupled with cutting edge advances in clustering and high-performance computing, a vast big data tools are available today such as Hadoop Infrastructure, Spark, H2O, MapR, and MapReduce. These tools have seamless integration with languages such as Julia, R, Python, Scala, etc.
"Big data is high-volume, high-velocity and high-variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making." [14]. This paper therefore, takes a direction to addressing the challenges earlier mentioned by relying on the intended promises of big data analytics such as scalability, robustness, massive support of machine learning algorithms, in-memory parallel processing and massive large dataset support. In our model, we propose Random Forests Classifier implemented on H2O platform running on R.

MACHINE LEARNING
Machine learning (ML), a subfield in computer science had evolved more than a decade with promises of finding solution to problems in Pattern Recognition, Natural Language Processing (NLP), Data Mining and Extraction, Computational Theory in Artificial Intelligence world. The computer is given to learn without explicit programming. Construction of algorithms that can learn from hidden interestingly patterns and makes useful predictions provided. These algorithms accepts input as explanatory features to which predictable or devisable outputs were generated.
ML has long be classified as supervised and unsupervised learning with recent exploration in reinforcement learning. Supervised learning approach learns by training a labeled data to predict a target class, while unsupervised and reinforcement learning build mining results (clustering or grouping similar kind) from unlabeled dataset. Figure 1.0 provides a simple overview of machine learning categories.
Most of existing classical ML algorithms has various limitation of dealing with data sample. Some works better when data sample attributes have uniform data scale. Categorical data attribute can be handled well by some while others can perform better on small data samples.
Our approach in this work was to consider a modern machine learning algorithm-Random Forests, from big data analytical perspective on the KDD'99 full dataset, thus overcoming the challenge of data scalability.

Dataset
The KDD Cup 1999 dataset was made available by NSL for the third international knowledge and data mining tools competition [15] on intrusion detection. It has since then become the benchmark for various research works on intrusion detection systems. The dataset were provided in different sizes with both the training and test set (see Table 1.0).  [16] This is owing to the fact that most algorithms cannot scale up the whole dataset, while some research argument reported that 78% training and 75% testing records are duplicated. This causes bias towards minority attach such U2R and R2L. Meanwhile duplication of attack records could be a pointer to the fact that such attack is frequent more than the other. But it was noted that frequent attacks are less dangerous than the less frequent ones, meaning reasonable attention need be given to the less frequent ones too.
In this paper, we implemented the full data set by classifying attacks into five categories including normal traffic (See Table 2.0), and by solving the problem of imbalanced intrusion using over sampling techniques on the minority attack since the majority attacks has large number of instances as recorded in the dataset.

Data Preprocessing
The original full data set of KDD'99 consist of 42 features, in which 41 are the exploratory attributes while 42nd attribute is the target class. The 41 attributes consist both continuous, nominal and categorical data values. Since attack types can be classified into unique classes and for the purpose of effective computation, there was the need to reprocess the data into the following categories:  Normal: These are network traffics seen has non-harmful connections in the network  Denial of service(DOS): This type of traffic tries to prevent legitimate users access to system and network services e.g. smurf, back, neptune, teardrop, pod and land  Probing: This attack target the host to exploit information e.g. satan, ipsweep, portsweep and nmap  User to root (U2R): Super user privileges on local machine of users (victims) is the target of this type of attack e.g. buffer_overflow, rootkit, loadmodule and perl  Remote-to-local (R2L): Uses various technique to gain access by not having the account of the target host. e.g. guess_passwd, ftp_write, multihop, phf, spy, imap, warezclient and warezmaster. As earlier mentioned that most research efforts use only a subset (10%) of the full data set, because of computational consideration since the dataset is very large. This is the major focal point of this work. For computational scalability, our proposed model, in one part decided to analyze the whole dataset without any preprocessing and on the other hand carried out different stages of data preprocessing: Missing value imputation, features selection, discretization, and sampling techniques. 3 Table 2.0 provides the re-coding of target class (attack types) into the five different class using the same recoding mechanism.

.4 Sampling
Further in the data preprocessing is the sampling of the attacks type. The attack types (Probe, U2R and R2L) with less connections in the full data set are considered. Over-sampling method was carried out on these less attack types in order to overcome any bias the other attack with large connections might be having on them. Random replication was employed on each of the connections. The analysis of our replication is presented in the section 5.

PROPOSED MODEL: RANDOM FORESTS (RF) AND H2O
In this paper, we implemented a RF classifier on H2O platform. RF classifier is an ensemble machine learning method for classification, regression and other tasks [17]. The classifier works by adding addition layer of randomness to bagging. It further constructs each tree using a different bootstrap technique sample of the data and changes the classification or regression of trees constructed. It has proved to be efficient and scalable in handing dataset with varying attributes types and sizes compare to other classification trees [8]. Algorithm 1.0 below describes the procedure of RF.

H2O Platform
Big data analysis, with advances in high performance computing, opens up vast development cutting across software and hardware tools; one of which is H2O. H2O is a fast, scalable, open source machine learning tools for big data and smarter applications. Advanced and classical algorithms were readily provided to solving diverse problems in machine learning [18]. Figure 2.0 describes the architectural framework our proposed model using H2O technologies integration with data.table package both implemented on R language [19]. The following cutting-edge features make H2O widely acceptable and applicable to machine learning for big data.
i. Best of breed in open source technology. ii. Support for web UI and easy interfacing. iii.
All round support of common database, file type and data using integrated development environment (IDE), data compression and all data platforms. iv.
Massively scalable bid data support in real time manner. v.
It provides a real-time Data Scoring with the implementation of Nano fast scoring engine. vi.
Vast Machine Learning algorithm in a parallel and distributed built approach were readily supported. The interfacing was easy for parameters tuning vii.
There is also an ever-growing support for native integration with widely accepted languages such as R, java, Scala and Python. The REST API is robust for easy work flow among the tools. viii.
In memory parallel processing one can stop using sampling data and start using the whole data set available for the analysis. This enables data size of large when readily captured for machine learning processes.

Datasets
The RF classifier from the H2O platform was implemented on R. The KDD'99 full dataset totally 4, 898, 431 records was analyzed, re-processed and re-sampled prior to the training process. The dataset was re-sampled with total 1, 674, 595 connections making the total training set to 6, 573, 026 records. The test dataset has 431, 330 records. The details analysis of results were provided in the next sections. Figure 3.0 shows the distribution of attack types prior to replication and grouping. Figure 4.0 then shows how the distribution has fared after replication and grouping.

Timing and Duration
Our coding approach provided a template of measuring time of training the dataset. Table 3.0 shows the details. The training was completed in 14, 604.80s (4 hours 5mins) on HP ProBook 6460b core i5 2520M, 2.50GHz with 16 GB RAM running 64-bit R. This should provide basis for comparison with other architecture, big data tools and ML algorithms.  Figure 5.0 illustrates the duration of each attack type prior to reprocessing. Portsweep was observed to have the highest duration of time in attacking operations followed by warezclient while land, pod are the least observed.

Model Metrics
The RF model was training with basic parameters settings of 500 number of trees and 100 depth. The outputs generated by the classifier as the details metrics used and produced by the model were shown in Table 4.0, 5.0 and 6.0. Figure 6.0 was generated to show classification error rate in relation to each training trees used. Classification error rate was minimized as the tree increases meaning larger tree yields less error. Thus RF fulfills its mandate by building ensemble (small trees) from the Forests to minimize error.

Feature Selection
The Random Forests (RF) classifier implemented in this work has and allow parameters turning for efficient computation. It also support feature selection. The RF in-built feature selection was used to calculate the value of variable importance of the training dataset. Figure 7.0 shows that V3: Service is the most importance attribute, followed by V23: count while V21: is_host_login and V6:dst_byte are least important (see Table 7.0 for details attribute description and coding).

Confusion Matrix
The confusion matrix is used to determine and summarize the performance classifier on test data such as accuracy, error rate, sensitivity (recall) and precision. A typical confusion matrix is provided in Table 8.0 below. The performance metrics that can be deduced from the matrix are as follows.
i. Accuracy: The proportion of classification of the whole dataset that were correct by 100%. The confusion matrix generated from the trained model is shown in Table 9.0, the overall performances metrics of equations (2.0) to (7.0) were provided in Table 10

CONCLUSION AND FUTURE WORK
In this paper, we have been able to demonstrate that big data analytical tools are capable of scaling any data size irrespective of the number of rows and columns. It also provides leverages points for implementing classical and modern machine learning algorithms to cope with large data thereby overcoming any computational and complexity challenges. 99.91% accuracy was obtained from the RF implemented. Classification error were minimized with larger trees. Feature selection was also efficiently computed. This is considerably a better result and it shows that RF is effective in providing predictive mechanism for up-coming IDS. Meanwhile the model was implemented on localhost cluster machine with minimal resources for big data analytics. Future work could be carried out by investigating the model on distributed-mode cluster machine with better resources in order to minimize training time. Other big data tools with integrated machines learning techniques could also be implemented for IDS analysis. This would further help to confirm the viability of the result obtained in this work. Big data analytics has come to stay as more and more data are being hyper-generated in and around us. Big data analytics promises to yield enormous benefits to life, information systems, businesses and National security at large, as it is being embraced.