Intrusion Detection System Based on Combination of Optimized Genetic and Firefly Algorithms in Cloud Computing Structure

Attackers or hackers are always looking to attack networks. Optimizing and securing system settings prevents hackers from accessing networks to a great extent. Intrusion Detection Systems (IDS), firewalls, and Honey Pot (Honey Pot) are technologies that can prevent hacking attacks on the networks. IDS or Intrusion Detection System analyzes all activities on the network and uses the information available on its database in order to determine if the activity is allowed or considered unauthorized. It also determines whether this activity can harm your network or not and eventually notify such activities by sending alarms or alerts to the system administrator. The main purpose of intrusion detection system is to classify data and network traffic. Thus, the detection of penetration in these systems is essentially a classification operation, so if the classification operation can be improved, the performance of intrusion detection system could get increased. For this reason, we have used the ECOC algorithm to improve classification performance by categorizing general problem into trivial classes. Improvement means that by breaking down the problem into smaller classes and assigning a separate classifier to each class, the power and accuracy of the classification operation increases, thereby overall system performance would improve. Other important factor which enhance diagnostic performance is the use of appropriate features in training and testing classifications. For this reason, we used firefly and genetic algorithms to select the proper features of each classification in each level. The main goal of this research is to provide an intrusion detection system with better penetration detection and performance. Based on the results obtained from the system diagnosis, our proposed system has been able to increase the detection rate up to 5% in comparison with other intrusion detection systems.


Introduction
Today, most vital infrastructures such as telecommunication, transportation, business and banking are managed by computer networks, so the security of these systems is very important for planned attacks. Most of these attacks exploit software errors and system security gaps. Since the complete elimination of software errors is not possible, each software includes security issues, which is known as software vulnerabilities. Researchers have been trying to find these vulnerabilities in order to identify system penetration gaps and then providing system protection through preventive or confrontive strategies.
Each dictionary has a meaning and concept for penetration. There are also many discussions over the meaning and influence of intrusion in computer science. Many consider intrusion as an unsuccessful attack, while others consider other definitions. As a result, penetration can be defined as "an active set of related events with the aim of unauthorized access to information, information conversion and system detriment in order to make the whole system unusable." This definition includes both successful efforts and unsuccessful efforts. " An intrusion detection system could be considered as a set of tools, methods, and documentation which identify and report unauthorized or unregistered activities through the network. The "Intrusion Detection" heading is not appropriate for such systems, since they may perceive specific action as an intrusion which is not fundamentally a penetrate. Furthermore, intrusion detection systems are not self-sufficient or independent because they take in to account as a small part of the computer protective system.

Research and related work
In a related research, [21], a new fuzzy method based on semi-monitored learning is presented to improve classification performance with the use of unlabeled examples. In [20] A hybrid intrusion detection system is proposed to identify internal and external attacks. In this system, signature recognition algorithms are used to identify internal attacks and fuzzy firefly algorithms to detect external attacks. [18] is a feature extraction algorithm which improves classification performance in intrusion detection systems. This algorithm is capable of supporting linear and nonlinear data. Furthermore, the system applies a hybrid algorithm using PSO for weight generation and classification combination. In [19], a new algorithm utilized PSO 1 for parameter and feature selection, subsequently SVM is used as a classifier.
In [3], an intrusion detection system is proposed using the decision tree algorithm and post-propagation neural network which has acceptable diagnostic accuracy. In [4] a composite system based on the postpropagation neural network and decision tree algorithm is designed using the KDD CUP 99 dataset. The results show that the intrusion detection system is not able to detect all types of network attacks by using the neural network without a decision tree.
In [5], for the classification of normal network activities and the Dos and Probe attacks, a multi-layer neural network is used for off-line system design and a multi-stage system is proposed for classification of normal data and related attacks. The results show that the system performance is better than one stage system. In [6], writers use the K-mean clustering algorithm to divide the dataset into several sub-spaces and then use a set of MLPs for each space. Their model has shown an acceptable performance on the KDD CUP 99 dataset. In [18], researchers have introduced a classification model that includes MLP and RBF, whose results have shown a passable performance on the network data.
The goal of most researchers is to identify different types of attacks from normal data . In most researches, decision tree has been used along with other classification algorithms to improve the performance of the systems. Given this, by the use of composite IDS which combines the capabilities of various classification algorithms excellent results were achieved. What is important is the application of simple and efficient methods for designing intrusion detection systems that can be implemented in real networks. Combined techniques, despite having acceptable performance, may not have a proper operational efficiency on different types of networks in terms of low identification speed. Regarding this issue, if a combined method maintains performance with the same speed, an optimal intrusion detection system can be implemented.

Proposed algorithm
Our proposed algorithm is presented in this section. The algorithm consists of three main steps: in the first step, the ECOC algorithm has been tried to improve the performance of the data classification. In the second step, we have used genetic and firefly algorithms to improve the performance of each classifier in order to select the most appropriate feature. Finally, in the third step, using the Hamming distance, we determine the class of each data.

In the first step
Using the ECOC technique and algorithm, we classify data classes into 15 classes with the strategy of separating paired-data class. The data classes contain five different classes, including a normal data class, and four data classes of attack types which are DOS, Probe, R2L, and U2R. The pattern of paired-class separation presented by the matrix utilizing ECOC algorithm in table1. In fact, the purpose of using ECOC algorithm is to divide and simplify the system in a way that all data types could classify in different combinations so that various information and analysis could be extracted.
Another unique feature of this algorithm would be data classifications in the simplest possible way. In the proposed classification method each class should only include two types of data classes. The pattern of the ECOC algorithm presents a MASK or a template using matrix, which can be random in most cases.
The columns of this matrix specify the categories or data classes, and each column represents all data types. Each column would consider same labeled-data as one group in order to maintain paired-class strategy. As a result, the pattern matrix included only two values which are 0 or 1.
For example, in each of the 15 classes, all of the data classes with value of 1 will be placed in one class and similarly all of the data classes with the value of 0 will be placed in another class. Normal In the second step After identifying paired classes, extraction phase and feature selection are performed by genetic and firefly algorithms in each class. In this step, classifiers which are mentioned as matrix columns, have the function of selecting features. Each classifier has the task of classifying a paired class. Although each data has 41 features, we do not require all of them since the data classification method is simplified. For this reason, we use the genetic and firefly algorithms to derive the necessary attributes of each category. Feature extraction operations are calculated for each classifier separately by genetic and firefly algorithms, and ultimately, we consider approved features as the required characteristics of each classifier.
In fact, selected joint features obtained from two algorithms would be perceived as main attributes of each class. The process of feature selection in each classifier which includes genetic and firefly algorithms seeking a pattern or mask for attributes is presented in table2. 1 0 0 In this mask, attributes specified with value 1 are selected and similarly attributes specified with value 0 are not selected. Finally, by using the decision tree and the selected attributes, we train each classifier and consider the function or accuracy of each tree's decision in terms of fitness function.

Finally, in the third step
In this step, the type of input data class must be specified. The output of the input data is similar to the following table. It has 15 columns, which are equal to the number of classifications. Table 3  To determine the class of each sample data, we compare the table rows of each sample data with five rows in the table. In fact, we calculate Hamming distance. Then we assign each sample data to a class that has a lower Hamming distance or higher similarity.

4-Evaluation
To assess the evaluation of the proposed method, the network traffic dataset is used. This dataset is collected by the Lincoln MIT Laboratory's Technology and Cytology Unit. The main KDDCUP'99 dataset consists of 41 entries that are presented as datasets and class labels. It also includes five different classes as shown in table5.   As mentioned, the KDD CUP 99 dataset is used to generate training and test datasets. This dataset has more than 4 million records, which is too much for the simulation process. Hence, according to [23], the training dataset contains 60593 normal records, 49115 DOS records, 1917 Probe records, 899 R2L records, 26 U2R records which are randomly selected from the main dataset. This dataset also provides 41 entries for each input data, and five normal classes as an output data which are DOS, Probe, R2L, and U2R attack classes.
There are various criteria for evaluating performance of intrusion detection systems. The proposed evaluation process utilizes recall, precision, and approximate mean (FM). Accuracy = The number of related retrieved documents / Total number of retrieved documents. Recall = The number of related retrieved documents /The total number of related documents in the database. Obtained results which are mentioned below highlight the advantages of using genetic and firefly algorithms. An important subject to be addressed is that the intrusion detection system has the task of categorizing and classifying different data with different characteristics. Therefore, specifying system performance for each group of data can express the overall performance more explicitly. As shown in the table6, the performance of each data class varies with other classes since the required features for classification task are different, and more importantly, the ratio of different data classes varies in the internet traffic. This is why the intrusion detection systems detect normal data better than other attacks data classes. So, if we look precisely, we'll see that higher data ratio in different classes leads to better system performance in other specific classes because of training enhancement due to data ratio development.
The system function is very important in detecting a U2R attacks, Because the amount of U2R related data in a traffic dataset is limited on the internet. Therefore, training and testing these data is more difficult than other data groups. But ultimately, the performance of the intrusion detection system is generally reported.  Table 7 we plan to compare performance of the proposed system with a number of related systems. As shown in the table below, different algorithms used for implementation process. By examining other similar systems, we conclude that hybrid systems will bring better results since each algorithm has its own strengths and weaknesses, therefor if one can use the strengths of each algorithm, a system with acceptable performance would be provided.

Conclusion
In this paper, by using a huge amount of data in a cloud computing environment, we proposed a method based on combination of firefly and genetic algorithms to detect intrusions in cloud computing structure with an acceptable accuracy.
The firefly algorithm is used to select the initial population which get involved in the genetic algorithm. Furthermore, it could improve the genetic algorithm by applying early randomized chromosome population. Cloud computing is a large and complex environment, including hardware, software, and security. The success or failure of cloud services depends on users' trust. Trusting that their data and processes are protected in a safe and secure environment. In this research, the most critical part is to ensure a secure environment, by providing a basic view of hardware and software security policies. In future, high degree existence of cloud computing, encourages attackers to penetrate due to the large amount of data and resources. Although using further attack recognition techniques would minimize related losses, for future work, utilizing open standards to prevent conflicts and lock-in problems and setting up specific security standards for cloud computing structure are highly recommended.