Statistical Analysis of the TSAD Interactome in Multiple Sclerosis: Multiple Testing, High Dimensional Regression and Interactions

Multiple Sclerosis (MS) is a chronic inflammatory disease of the Central Nervous System (CNS). The presence of Oligoclonal Band (OCB) in Cerebrospinal Fluid (CSF) is an important diagnostic tool in MS. The main aim of this study was to determine SNPs and SNPs-SNPs interactions in the genomic TSAD (T-Cell Specific Adaptor Proteins) region which explain the difference between two MS conditions: OCB positive vs. OCB negative in sampled Norwegian patients. The data to this study was obtained from the MS Registry Ulleval University Hospital, Oslo, Norway. Of 899 patients, 802 were OCB positive and 97 OCB negative, each has 923 SNPs at specific position in their chromosome measure. The study incorporated two different statistical methods to our data analysis. First, we apply variable selection based on Lasso method, here we discuss Lasso for Logistic Regression Analysis and see interaction effect for the Lasso selected SNPs. In the second section, we analyze variable selection based on test of association. Here, we used Chi-Square and Fisher’s exact test of association to see association between the statuses of OCB to each SNP. We found out that the Chi-Square test of association selected 34 significant SNPs and the association test for Fisher’s exact test selected 38 significant SNPs at significance level of 0.05. Then we used Boferroni and False Discovery Rate to statistically significant SNPs for the multiple testing corrections. Finally, we looked for interaction effects to some selected SNPs from test of association and we have determined SNPs and SNP-SNP interactions which appear to have significant associated to the OCB subpopulation of MS patients, based on our study of the Norwegian cohort. These are selected SNPs which have been selected by any of the methods that we used (various hypothesis testing and regressions). This is the SNPs that we think should be studied further and validated on a new dataset. Of particular importance are the seven SNP-SNP interactions which we found. It is the first time a SNP-SNP study has been performed on these data, and the finding will be communicated to the Norwegian molecular biologists to be followed up.


Research Questions
 Can we use some of the SNPs which are associated to OCB conditions in genomic TSAD region, so that they will allow a classification of future patients in the OCB conditions of MS?  Do SNPs in the genomic TSAD region have significant association with the OCB condition?  Can we find interactions of SNPs in the genomic TSAD region which are associated to the OCB condition, and therefore describe first elements of the biological mechanisms of the OCB condition in MS?

Objectives of the Study
The general objective of this study is to identify SNPs and SNPs-SNPs interactions in the genomic TSAD region which explain the difference between two MS conditions: OCB positive vs. OCB negative. Specific Objectives the Study  To apply logistic regression and fit the model with lasso penalty to determine SNPs in the genomic TSAD region which are associated to the status of OCB so that they will allow a classification of future patients in the OCB conditions of MS.  To investigate genetic factors of the genomic TSAD region which important for the disease condition of OCB in Multiple Sclerosis with different kind of statistical hypothesis (Fisher Exact and Chi-Square test of association).  To assess which SNPs in the TSAD region have significant association with the OCB condition, using multiple testing correction method: Bonferroni and False Discovery Rate.  To identify SNP-SNP interactions in the genomic TSAD region which are associated to the OCB condition, and therefore describe first elements of the biological mechanisms of the OCB condition in MS.

Methodology 2.1 Description of the Data
The data for this study was secondary data collected from two Norwegian samples:

Description of Variables Considered under the Study
In this study, we categorized the status of a total of 899 MS patient into two groups. Of these 802 persons are OCB positive and 97 persons OCB negative. OCB are proteins found in the spinal fluid of MS patients. They are called oligoclonal bands, because they appear in the part of the electrophoresis of the spinal fluid where the gamma globulins are found. It is thought to be produced by single clones of B-cells located within the CNS of the MS patients. The origin of the OCB is not understood, but they are viewed as quite specific for MS, although not all MS patients do have these bands. It is important to try to find a genetic signature of the OCB positive versus OCB negative MS patients. This will allow a better understanding of the diseases and also a possibility to predict the long term prognoses. For each person, we have used 923 SNPs (or Single Nucleotide Polymorphism) in 923 specific positions of their DNA. These 923 SNPs have been chosen because they are located in, or around, genes on the DNA who have to do with an important protein, called T cell specific adapter protein (TSAd), which is believed to have a very important role in development of MS. This means that these SNPs have some good chances to be relevant in the distinction between OCB and not OCB, as they reside on genes which biologists believe are important in MS. T cells play a crucial role in our defense against infection and cancer, but these cells are also dangerous sometimes, as they can start to react against the body they are supposed to defend, leading to autoimmune disease, like MS. This probably happens because these cells are triggered in some way. We do not know exactly how and why. The last 10 years many research groups worldwide have therefore focused primarily on elucidating molecular mechanisms for control of T cell activation, and in particular the role of T-cell specific adapter protein (TSAd) in this process. One group is led by Professor Anne Spurkland, and she has selected for us these 923 SNPs. The importance of TSAd is described in many papers. We cite here the ones of the group of Professor Spurkland. They have found that TSAd modulates early signaling events in T cells (Kolltveit et al, 2008) [18] . Spurkland has found that a certain gene participating to the production of TSAd, is associated with increased susceptibility to MS and juvenile arthritis (Lorentzen et al, 2008) [19] . Genetically determined variation in the expression level of TSAd may provide a mechanism for how TSAd contribute to genetic susceptibility to autoimmune disease. Each person has 923 SNP (or Single Nucleotide Polymorphism) at specific position in their chromosome measure. Each patient has age at onset 33.7 years, old age of diagnosis 48.5 years clinical course and patients with primary progressive symptoms: 13%.

The Response Variable of the Study
The outcome of interest in this study is the status of OCB, where OCB is categorized into OCB positive and OCB negative i.e., the outcome is categorical response with two categories. Hence, the response variable for the ith person is represented by the random variable Yi with two possible values coded as 1 for the success OCB "positive" and "0" otherwise Mathematically, we write this as:

Independent Variables of the Study
Each person in this study has 923 SNPs (or Single Nucleotide Polymorphism) at specific position in their chromosome measure. The layout for these 923 SNPs (explanatory variables) and their coding is given below. The unavailable data value is coded as "NA" and each SNP has three allele sequence in their DNA. That is AA TA or AT and TT. For the sake of analysis it was coded as 0 for AA 1 for AT or TA and 2 for TT. Therefore, the summary in the following table 1 shows the coding of this predictor variable.

Method of Data Analysis 2.3.1 Fitting the Logistic Regression Model to Data
As in multiple regression analysis, there are two important stages in the analysis of data. First, estimation of the parameters in the model must be obtained. Second, some determination must be made of how well the model actually fits the observed data. In multiple regression analysis, the parameter estimates are obtained using the leastsquares principle and assessment of fit is based on significance tests for the regression coefficients as well as on interpreting the multiple correlation coefficients. But in LRA, the parameters that must be estimated from the available data are the constant, , and the logistic regression coefficients, j. Because of the nature of the model, estimation is based on the maximum likelihood principle rather than on the least-squares principle. Maximum likelihood estimation (MLE) is the standard method of estimating the unknown parameters in a logistic regression model. This method yields values for the unknown parameters which maximize the probability of obtaining the observed response values. The likelihood function expresses the probability of the observed response values as a function of the unknown parameters. In the context of logistic regression analysis, maximum likelihood estimation (MLE) involves the following. First, we define the likelihood, L(parameter|data), of the sample data as the product, across all sampled cases, of the probabilities for success or for failure: Note that Y is the 0/1 outcome for the ith case and, Xi1,⋯,Xip are the values of the predictor variables for the ith case based on a sample of n observations. The use of Yi and 1-Yi as exponents in the equation above includes in the likelihood the appropriate probability term dependent upon whether Yi =1 or Yi =0. Using the methods of calculus, a set of values for and the can be calculated that maximize L(parameter|data) and these resulting values are known as maximum likelihood estimates (MLE's). This maximization process is somewhat more complicated than the corresponding minimization procedure in multiple regression analysis for finding leastsquare estimates. However, the general approach involves establishing initial guesses for the unknown parameters and then continuously adjusting these estimates until the maximum value of L(parameter|data) is found. This iterative solution procedure is available in statistical packages. The usefulness of the model as a whole can be assessed by testing the hypothesis that, simultaneously, all of the partial logistic regression coefficients are 0; i.e., H: ,j = 0 for all j. In effect, we can compare the general model given above with the restricted model ln( This test, that is equivalent to testing the significance of the multiple R in multiple regression analysis, is based on a chi-squared statistic (R software calculates the value of "Chi-Square"). Finally, different logistic regression coefficients models fitted to the same set of data can be compared statistically in a simple manner if the models are hierarchical. The hierarchy principle requires that the model with the larger number of predictors include among its predictors all of the predictors from the simpler model. Given this condition, the difference in model chi-squared values is (approximately) distributed as chi-squared with degrees of freedom equal to the difference in degrees of freedom for the models. In effect, this procedure tests a conditional null hypothesis. If the models are specified, R software calculates Chi-square value for each model and this can be used to test whether or not the additional predictors result in significantly better fit of the model to the data.

Variable Selection for Logistic Regression: Lasso (Least absolute shrinkage and selection operator)
A statistical model is a simplification of reality (Agresti, 2007) [20] . At the initial stage of modeling, a large number of candidate predictors are considered to minimize possible modeling biases (Fan and Li, 2006) [21] . However, in most cases, not all the predictors have significant effects on the response variable. In statistics, a result from certain hypothesis testing is called statistically significant if it is unlikely to have occurred by chance. A simpler model that contains only the important predictors is preferred because it is easy to explain. Parsimony is especially important for high dimension data. The parsimony means that the simplest plausible model with the fewest possible number of predictors is desired. Variable selection plays an important role in regression analysis and is intended to select the best subset of predictors. There are typically two competing goals in statistical modeling: The model should be complex enough to fit the data well, and also should be simple to interpret (Agresti, 2007) [20] . In linear regression, parameter estimation by the ordinary least square (OLS) method is unbiased. However the estimates may have large variance in some cases, the occurrence of multi-co linearity for instance is one case. With slight sacrifice of bias, ridge regression tends to improve the prediction accuracy by shrinking some coefficients. But ridge regression will not shrink values of any coefficients to exact 0, and the fitted model might be too complex to interpret. In 1996, Tibshirani introduced a different shrinkage method, called the Lasso (least absolute shrinkage and selection operator). This method shrinks values of some coefficients to 0 by a constraint on the sum of absolute values of regression coefficients, so Lasso can serve as a tool for variable selection. The Lasso is a shrinkage method like ridge regression, with subtle but important differences. Like ridge regression, penalizing the absolute values of the coefficients introduces shrinkage towards zero. However, unlike ridge regression, some of the coefficients are shrunken all the way to zero; such solutions, with multiple values that are identically zero, are said to be sparse. The penalty thereby performs a sort of continuous variable selection. The resulting estimator was thus named the Lasso, for "Least Absolute Shrinkage and Selection Operator" and defined by: is a tuning parameter which controls the amount of shrinkage that is applied to the estimates and for all t , = > H . Just as in ridge regression, we can re-parameterize the constant ,0 by standardizing the predictors; the solution for , 4 0 is > H , and thereafter we fit a model without an intercept by assuming > H = 0 and omitting , 0 without loss of generality. Computing the Lasso solution is a quadratic programming problem, although efficient algorithms are available for computing the entire path of solutions as I is varied, with the same computational cost as for ridge regression. Because of the nature of the constraint, making t sufficiently small will cause some of the coefficients to be exactly zero. Thus the Lasso does a kind of continuous subset selection. If t is chosen larger than t0 = ∑ E, -E -% ≤ F (where , 4 j = , 4 jls, the least square estimates), then the Lasso estimates are the , 4 jls. (Hui et al, 2007) [22] develop versions of AIC and BIC for Lasso that can be used to find an "optimal" value or or equivalently t. They suggested using BIC to find the "optimal" Lasso model when sparsity of the model is of primary concern. Lars, least angle regression (Efron et al, 2004) [23] provides a clever and hence very efficient way of computing the complete Lasso sequence of solutions as s is varied from 0 to infinity. In fact, (Hui et al, 2007) [22] show that it is possible to find the optimal Lasso fit with the computational effort equivalent to obtaining a single least squares fit. Thus, the lasso has the potential to revolutionize variable selection. It employs an L1-type penalty on the regression coefficients which tends to produce sparse models, and thus is often used as a variable selection tool as in (Tibshirani, 1997 andOsborne, et al, 2000) [24] . (Knight and Fu, 2000) [25] studied the asymptotic properties of Lassotype estimators. They showed that under appropriate conditions, the Lasso estimators are consistent for estimating the regression coefficients, and the limit distribution of the Lasso estimators can have positive probability mass at 0 when the true value of the parameter is 0. It has been demonstrated in (Tibshirani, 1996) [26] that the Lasso is more stable and accurate than traditional variable selection methods such as best subset selection. For Multiple Regression Analysis, lasso penalty give as: Where, M is the Lasso penalizing parameter. For logistic regression, Lasso modifies the traditional parameter estimation method, maximum log likelihood, by adding the L1 norm of the parameters to the negative log likelihood function, so it turns a maximization problem into a minimization one. To solve this problem, we first need to give the value for the parameter of the L1 norm, called tuning parameter. Since the tuning parameter affects the coefficients estimation and variable selection, we want to find the optimal value for the tuning parameter to get the most accurate coefficient estimation and best subset of predictors in the L1 regularized regression model. There are two popular methods to select the optimal value of the tuning parameter that results in a best subset of predictors, Bayesian information criterion (BIC) and cross validation (CV). Therefore, best subsets of predictors are selected 6 after standardizing the predictor variable by applying BIC or k-fold cross-validation (CV) (Tibshirani1, 996) [26] . Then, the package glmnet gives an optimum tuning parameter or N for CV. In case of logistic regression with lasso, the model can be expressed as: Where M is positive integer that determines the amount of shrinkage. As with the Lasso, we typically do not penalize the intercept term, and standardize the predictors for the penalty to be meaningful. (Efron, 2004) [23] proposed the Least Angle Regression (the Lars), and showed that there is a close connection between the lars, the Lasso, and another model selection procedure called the Forward Stage wise regression. Each of these procedures involves a tuning parameter that is chosen to minimize the prediction error.

Selection of Tuning Parameter: Cross Validation
Cross validation is a popular method for estimating the prediction error and comparing different models. Typically, the dataset partition into two parts: the training data and the testing data. In k-fold cross validation, the dataset will be randomly split into k mutually exclusive subsets of approximately equal size. Among the k subsets, one subset is retained as validation data for testing the model, and the remaining k-1 subsets are used as training data to fit the model. The cross validation process will be repeated k times, and each of the subsets is used exactly once as validation data. Different values of the tuning parameter could result in different fitted model using the same training data. The optimal model is the one that has the minimum cross-validated errors, and the corresponding value of the tuning parameter for the optimal model is preferred (Jerome et al, 2008) [27] .

Lasso for Logistic Regression to See Interaction Effect
Testing for interactions after identifying main effects or marginal predictors is the next step. This strategy is prompted by the number of interactions possible. With p predictors, we have: = B A Two-way interactions but, with hundreds of thousands of SNPs, it is impossible even to examine all two-way interactions (Tong, 2009) [28] . To evaluate the performance of Lasso penalized regression in association testing, we focus on underdetermined problems where the number of predictors' p far exceeds the number of observations n. Here for two-way case, the model is: Which involve both marginal and two way interactions.

Statistical Hypotheses Testing
Hypothesis testing is concerned with using observed data to make decisions regarding properties of (i.e., hypotheses for) the unknown data generating distribution. In any testing problem, two types of errors can be committed. A Type I error, or false positive, is committed by rejecting a true null hypothesis. A Type II error, or false negative, is committed by failing to reject a false null hypothesis. Ideally, one would like to simultaneously minimize both the number of Type I errors and the number of Type II errors. Unfortunately, this is not feasible and one seeks a trade-off between the two types of errors. This trade-off typically involves the minimization of Type II errors, i.e., the maximization of power, subject to a Type I error constraint. As in the case of single hypothesis testing, one can report the results of a multiple testing procedure in terms of the following quantities: rejection regions for the test statistics, confidence regions for the parameters of interest, and adjusted p-values. Adjusted p-values, for the test of multiple hypotheses, are defined as straightforward extensions of unadjusted pvalues, for the test of individual hypotheses: the adjusted p-value for a particular null hypothesis is the smallest nominal Type I error level (for the multiple test of all hypotheses) at which one would reject this null hypothesis. The smaller the adjusted p-values indicates, the stronger the evidence against the corresponding null hypothesis (Sandrine et al, 2008) [29] .

Chi-Square Test of Independence
The Chi-Square test may be used both as a test of goodness-of-fit (comparing frequencies of one categorical variable to theoretical expectations) and as a test of independence. The underlying arithmetic of the test is the same to test of goodness-of-fit but the only difference is the way the expected values are calculated. However, goodnessof-fit tests and tests of independence are used for quite different experimental designs and test different null hypotheses. The Chi-Squared test of independence is used when we have two categorical variables, each with two or more possible values.

Fisher's Exact Test
This test is used when we have two nominal variables. A data set like this is often called an (R x C table), where R is the number of rows and C is the number of columns. Fisher's exact test is more accurate than the Chi-Squared test of independence when the expected numbers are small. The most common use of Fisher's exact test is for (2 x 2 tables). But for our case we use (2 x 3 tables) for each SNPs. Fisher's Exact test assumes that the row and column totals are fixed. In the much more common design, the row totals and/or column totals are free to vary. In this case, the Fisher's Exact test is not, strictly speaking, exact. It is still considered to be more accurate than the chi-square and we should feel comfortable using it for any test of independence with small numbers.

The Problem of Multiple Testing Correction
Any time you reject the null hypothesis because a p-value is less than your critical value. It is possible that you are wrong; the null hypothesis might really be true, and your significant result might be due to chance. A P-value of 0.05 means that there is a 5% chance of getting your observed result, if the null hypothesis were true. This problem, that when you do multiple statistical tests, some fraction will be false positives, has received increasing attention in the last few years. This is important for such techniques as the use of microarrays, which make it possible to measure RNA quantities for tens of thousands of genes at once (Mcdonald, 2001) [30] . Controlling for multiple testing to accurately estimate significance thresholds is a very important aspect of studies involving many genetic markers, particularly GWA studies. The type I error, also called the significance level or false-positive rate, is the probability of rejecting the null hypothesis when it is true. The significance level indicates the proportion of false positives that an investigator is willing to tolerate in his or her study. The family-wise error rate (FWER) is the probability of making one or more type I errors in a set of tests. Lower FWERs restrict the proportion of false positives at the expense of reducing the power to detect association when it truly exists. It is then important to keep track of the number of statistical comparisons performed and correct the individual SNP-based significance thresholds for multiple testing to maintain the overall FWER (Zondervan et al, 2007) [31] . In order to choose an appropriate multiple testing methods, it is critical to select the definition of correct decisions. The following subsections introduce the common multiple testing correction methods.

Bonferroni Correction
The classical approach to the multiple comparison problem is to control family-wise error rate (Bland et al, 1995) [32] . Instead of setting p-value for significance, or, ) to 0.05, a lower ) is used. If the hypothesis is true for all tests, the probability of getting one result that is significant at this new lower, level is 0.05. In other words, if the null hypotheses are true, the probability that the family of tests includes one or more false positive due to chance is 0.05. The most common way and simplest of the p-value-based procedures to control the family-wise error rate is the well-known Bonferroni procedure. The basic procedure is: The significance level ()) for an individual test is found by dividing the family-wise error rate (usually 0.05) by the number of tests. If we are doing 100 statistical tests, the ) level for an individual test would be X.XZ [XX = 0.0005, and only individual tests with p-value < 0.0005 would be considered significant. The Boferroni correction assumes that the tests are independent of each other, and the method has good job of controlling family-wise error rate for multiple, independent comparisons; but important issue with Bonferroni correction is deciding what a "family" of statistical test is. However there is no firm rule on this; we have to use our judgement, based on just how bad a false positive would be.

False Discovery Rate
A different approach to multiple testing does not try to control the family-wise error rate, (the probability can be computed under the assumption that all hypotheses are simultaneously true), but focuses instead on the proportion of falsely significant genes. As we will see, this approach has a strong practical appeal and it is an alternative approach to control family wise error rate. It is the proportion of "discoveries" (significant results) that are actually false positive. For the example, suppose we are using microarray to compare expression levels of 100, 000 genes between liver tumors and normal liver cells. We are going to do additional experiments on any genes that show a significant difference between the normal and tumor cells, and we are willing to accept up to 10% of the genes with significant results being false positive; we find out they are false positives when we do the follow-ups experiment. In this case, we would set our false discovery rate to 10%. A good technique for controlling false discovery rate was briefly mentioned by (Simes, 1986) [29] and developed in detail by (Benjamini and Hochberg, 1995) [33] . ;A]/ \^ . Here our focus is on the false discovery rate which is defined as: FDR = E(\ ] ;R |\ _ ) that is, the expected proportion of genes that are incorrectly called significant, among the R genes that are called significant. The expectation is taken over the population from which the data are generated. If the hypotheses are independent, (Benjamini and Hochberg, 1995) [33] show that regardless of how many null hypotheses are true and regardless of the distribution of the pvalues when the null hypothesis is false, this procedure has the property: 3. Reject all hypotheses H0j for which `d ≤ `e ; BH rejection threshold.
That is put the individual p-values in order, from smallest to largest. The smallest p-value has a rank of i=1, the next has i=2, etc. Then compare each individual p-value to ( )Q, is significant, and all p-values smaller than it are also significant.

Results
The main objective of our work is to determine SNPs and SNPs-SNPs interactions in the genomic TSAD region which explain the difference between two MS conditions: OCB positive vs. OCB negative using the data from the Oslo MS DNA Bio Bank and The Norwegian MS Registry and Bio Bank. Accordingly, the analysis is carried out in two different approaches or variable selection methods: Variable Selection Based on Lasso applied to Binary Logistic Regression analysis and Variable Selection Based on Test of Association.

Analysis I: Variable Selection Based on Lasso Method
Variable selection plays an important role in regression analysis and is intended to select the best subset of predictors. Therefore, here we use the Lasso (Least absolute shrinkage and selection operator) which is a recent method of variable selection which applied when the number of samples is relatively smaller than the number parameters (variables). The Lasso method shrinks values of some coefficients to 0 by a constraint on the sum of absolute values of regression coefficients (penalty), so the Lasso can serve as a tool for variable selection. Such solutions, with multiple values that are identically zero, are said to be sparse. The penalty thereby performs a sort of continuous variable selection and the resulting estimator was thus named the Lasso, for Least Absolute Shrinkage and Selection Operator. In the following section, we apply the Lasso for Binary Logistic Regression analysis and we want to see interaction effects to some selected SNPs.

Lasso for Logistic Regression Analysis
Here we apply the Lasso variable selection to binary logistic regression on 923 SNPs. We run Lasso for 120 times, because it selects different random folds, each at different times. Then, Lasso method selects four SNPs after the optimum lambda (i.e., lambda is obtained by subtracting one standard deviation from lambda.min) with ten and five-fold cross validation with glmnet package in R. These are most frequently selected SNPs which are the one which we selected in the end (See the Lasso Output for 923 SNPs at Appendix Section). Finally, we fitted the model to our dataset which contains those selected SNPs by our method (see Equation 11). The Table 3 shows the four Lasso selected SNPs with the corresponding regression coefficients and chromosome positions. Then we plot the corresponding cross validation curve to our dataset (See Figure 3). Log( 3 g 13 g ) = 0.00098 * NA + 0.08 * NA-0.0261 * TEK-0.0894 * EGFR (8) where, 3 g is a conditional probability of the form P(Y=OCB|SNP1;⋯;SNP923). That is, it is assumed that success or Y=OCB is more or less likely depending on combinations of values of the SNPs or predictor variables.

Interaction Effect for the Lasso Selected SNPs
In this section, we study interaction effects of the ten variables. Four SNPs which are selected by Lasso (See Table  3) and six interactions obtained by combination of these four SNPs which we selected by Lasso method with Logistic Regression Analysis on 923 SNPs. Then we have run Lasso to Logistic Regression Analysis for these 10 variables. So, we found out that none of the interaction variables are selected (See the Lasso Output for 10 Variables Journal of Medicine, Physiology and Biophysics www.iiste.org ISSN 2422-8427 An International Peer-reviewed Journal Vol.61, 2019 9 at Appendix Section).

Analysis II: Variable Selection Based on Test of Association
In this section, we use different approach to select variables. First section identifies an important variable by using test of associations. Here, we used Chi-Square and Fisher's exact test of association. In the second section, we perform multiple independent hypotheses, that is we need to do some multiple testing corrections to control family wise error rate or Type I error rate for multiple hypotheses testing to multiple hypotheses from Fisher's exact test of association. Here we use FDR and Bonferroni adjustment method for the 38 P-value (raw or unadjusted P-value) obtained by Fisher's exact test. Finally, we identified the interaction effects to SNPs selected by Fisher's Exact test.

Fisher's Exact and Chi-Square Test of Association
We conducted 923 independent tests one for each SNP. We found out that the Chi-Square test of association selected 34 significant SNPs and the association test for Fisher's Exact test selected 38 significant SNPs at significance level of 0.05 (See Table 4 and Table 5 respectively).

.2 Bonferroni and FDR for Multiple Testing Correction
At FDR of 0.05, Benjamini and Hockberg (BH) adjustment for FDR shows all 38 SNPs are significant that is, all 38 raw P-value is below adjusted FDR. In similar manner, at family-wise-error rate of 0.05, the Bonferroni adjustment shows none of SNPs are significant that is, none of the adjusted P-value is below unadjusted P-value (See Table 6).

.3 Interaction Effects of SNPs which found to have Significant by Test of Association
Here we study interaction effects of the 38 SNPs which are found to have significant association. These are 703 interactions obtained by combination and 38 SNPs which we found to have significant association by Fisher's Exact test. So, we have made variable selection among the 741 variables using Lasso method again with Binary Logistic Regression Analysis. We found out that seven SNP-SNP interactions are significant by our method and none of main effects are significant (See Lasso output for 741 variables at Appendix Section). These seven interactions variables are which listed below in Table 7. Then we plot the corresponding CV which is shown in Figure 4. Finally, we fitted the model to our data which contains only seven interaction effects (See Equation 12 below) Log(π /(1-π ̂ )) =-0.020 * (a*b) -0.054 (c*d) -0.001 (9) (f* g)-0.06 (f* b) -0.018 (c *e) -0.051 (c *e) -0.0067 (e *e) Where, a=MYO7B, b=EGFR, c=NA, d=ITK), f=FYN, g=NA and e=TEK and is a conditional probability of the form P(Y = OCB|SNP1,⋯,SNP923) or P(Y = OCB|SNP1,2,⋯,SNP922,923). That is, it is assumed that success or Y=OCB is more or less likely depending on combinations of values of the SNPs or combination of SNPs.

Discussion
The results from analysis I with our proposed method (Variable Selection Based on Lasso Method) as a framework to determine important SNPs in the genomic TSAD region presented in this study identified four SNPs which appear to be associated to the Oligoclonal Bands(OCBs) subpopulation of multiple sclerosis patients (See Table  3). We also tested interaction effect by combining these four SNPs and finally the analysis showed none of the combinations (interactions) has an effects to Oligoclonal Bands (OCBs) subpopulation of multiple sclerosis patients. This findings suggests that these four SNPs independently determine the status of Oligoclonal Bands (OCBs) subpopulation of multiple sclerosis patients based on our study of the Norwegian cohort. Then, the finally model  Table 5). The results from analysis II for Chi-Square test showed 34 important SNPs at 0.05 significances level (See Table 4). The results from analysis II for FDR at 0.05 significant level of Benjamini and Hockberg (BH) procedure showed 38 SNPs are all important to determine Oligoclonal Bands (OCBs) subpopulation of multiple sclerosis patients (See Table 6). The similar result for Bonferroni correction or adjustment at family wise error rate of 0.05 showed none of SNPs are significant. The final results from analysis II for interaction effect to the variables obtained by combination among 38 SNPs selected previously from Fisher Exact test showed seven important SNP-SNP interactions which determine Oligoclonal Bands (OCBs) subpopulation of multiple sclerosis patients (See Table 7). Our study will be confirmed with similar studies which identified potentially interesting logic expressions that represent SNP interactions and measures for quantifying the importance of these features for classification in case control studies (Holger, 2008) [34] . Similar study indicated that many common diseases are influenced by interaction of certain genes and quadratic penalization not only correctly characterizes the influential genes along with their interaction structures but also yields additional benefits in handling high dimensional, discrete factors with a binary response (Young, 2008) [35] . Other similar analysis indicated that penalizing the size of the coefficients is a common strategy for robust modeling in regression classification with high dimensional data and examined the properties of the Lasso constraints applied to the coefficients in generalized linear models (GLM) to the specific application of modeling gene interactions (Young, 2006) [36] . Results from (Yoav, 2002) [27] in their experiments aims to identify genes with altered expression in the livers of mice with very low cholesterol levels compared to inbred control mice. They examined the p-values obtained directly from the raw t-statistics with 14 degrees of freedom. Then, Bonferroni adjustment points to eight rejections. Also applying the FDR controlling BH procedure on the raw p-values, they came up with the same eight genes identified as differentially expressed in the original analysis.

Conclusions and Recommendations 4.1 Conclusions
Lasso method selects four important SNPs which appear to be associated to the Oligoclonal Bands subpopulation of multiple sclerosis patients. These are most frequently selected SNPs which are the one which we selected in the end.
We also fitted the model to our dataset which contains those selected SNPs by our method (See Equation 9). From results of analysis I, we found out that none of the interaction variables are selected and the findings further suggest four SNPs in-dependently determine the status of Oligoclonal Bands subpopulation of multiple sclerosis patients. The results from analysis II, showed that 38 important SNPs from 923 SNPs in the genomic TSAD region that are associated with Oligoclonal Bands (OCBs) subpopulation of multiple sclerosis patients. Result for Bonferroni correction showed none of SNPs are significant at 0.05 levels. The results from analysis for FDR at 0.05 significant level showed 38 SNPs are all important to determine Oligoclonal Bands (OCBs) subpopulation of multiple sclerosis patients. Finally, results from analysis II showed seven important SNP-SNP interactions which determine Oligoclonal Bands (OCBs) subpopulation of multiple sclerosis patients, based on our study of the Norwegian cohort. Then, we collected all SNPs (See Table 8) (with the SNPs where they are located) which have been selected by any of the methods that we used (various hypothesis testing and regressions). These are the list of SNPs that we think should be studied further and validated on a new data set. Of particular importance are the seven SNP-SNP interactions which we found. It is the first time a SNP-SNP study has been performed on these data, and the finding will be communicated to the Norwegian molecular biologists to be followed up (See Table7).

Recommendations
The findings of this study have important implications to help biologists, healthy organizations, researchers and scientists to deal on disease prevalence and progression such that, genes are important factors for cause of MS. As the new study in this area, this may motivate interested groups and professionals to be aware of the disease, and it perhaps initiate them to their own contributions in the same area of research. Since Lasso selects SNPs which are best to perform classification (outcome) of a new patient. These are SNPs which each carry additional independent information on the classification. This means that two covariates which are both very strongly associated with the outcome and are also highly correlated with each other, will not be both selected by the Lasso, because only one covariate contributes to the best classification, the other carries the same information. The Bonferroni and FDR correction do not look to the correlation of the SNPs, but only on how many tests we performed, and reduce the significance level so that we make less false positives mistakes. Therefore, we have generated a series of biological hypothesis, supported by our stringent data analysis, which now need to be confirmed on a new population. If this will be the case, then it is possible to imagine that finding these SNPs in MS patients, will allow a better therapy. Therefore, our results need to be further validated.