Estimation of AUC for Assessing Its Significance in Classification Models

The assessment of the performance of a diagnostic test when test results are measured on continuous scale can be evaluated using the measures of sensitivity and specificity over the range of possible cut-off points for the predictor variable. This is achieved by the use of a receiver operating characteristic (ROC) curve which is a graph of sensitivity against 1-specificity across all possible decision cut-offs values from a diagnostic test result. This curve evaluates the diagnostic ability of tests to discriminate the true state of subjects especially in classification models. These tasks of assessing the predictive accuracy of classification models is always better achieved using a summary measure of accuracy across all possible ranges of cut-off values called the area under the receiver operating characteristic curve (AUC). In this paper, we propose a simple nonparametric method of calculating AUC from predicted probability of positive response involving multiple prediction rules. This method is based on the knowledge of non-parametric Mann-Whitney U statistic. Based on the predicted outcomes and observed outcomes, the performance of diagnostic tests is assessed for the classification models through the AUC calculated from these outcomes. The proposed method when applied on real data, the significance of AUC for the classification models is assessed. The method offers reliable statistical inferences and circumvents the difficulties of deriving the statistical moments of complex summary statistics seen in the parametric method. The proposed method as a nonparametric estimation is recommended for calculating the AUC as it compares favorably with the existing parametric and non-parametric methods.


INTRODUCTION
In medical sciences, the use of diagnostic procedures is based on clinical investigations or laboratory experiments or trials purposely to classify subject into diseased or non-diseased. These procedures makes for vital decision making aided with advanced machines/tools to detect any given condition. For decades now, receiver operating characteristic curve (ROC) analysis has been used as a popular technique of evaluating the performance or ability of a test to discriminate between alternative health status. The ROC curve represents a graph of sensitivity against 1-specificity across various cut-off values of diagnostic test. It assesses the effectiveness of continuous diagnostic test results to differentiate between groups of healthy and diseased individuals (Greiner et al., 2000;Zhou et al., 2002;Pepe, 2004). It is also a common tool for assessing the performance of various classification tools such as biological test results, diagnostic tests, and statistical distributional models and in assessing accuracy quantitatively or to compare accuracy between tests or predictive models. The ROC curve was originated in the theory of signal detection in the years 1950-1960(Green and Swets, 1966Egan, 1975) to discriminate between signal and noise. It has been used in so many areas such as radiology (Metz,1989), psychiatry (Hsiao et al,1989), epidemiology (Aoki et al, 1997), biomedical informatics (Lasko et al,2005). It can provide a direct and visual comparison of two or more diagnostic tests on a single set of scales. It is possible to compare different tests at all decision cut-offs by constructing the ROC curves. For statistical analysis, a recommended numerical index of accuracy associated with an ROC curve are often better used to summarize the information provided for the ROC curve into a single global value or index (Swets and Picket,1982). This index is called area under the ROC curve. AUC takes values between 0.5 (which corresponds to the diagonal ROC curve that passes through the points (0,0) and (1,1)) and 1 (representing perfect test where all cases are correctly classified). AUC represents the diagnostic accuracy of the test Y, so that the larger the area the better the diagnostic accuracy of Y. This means that values closer to 1 indicate that Y optimally discriminates between healthy and diseased subjects, while values near 0.5 indicate that the test is not informative (Zhou et al, 2002). According to Mann-Whitney (1947), AUC is the probability that the observed test result of a randomly selected subject from the diseased population ( 1 Y ) is larger than the observed test result of a randomly selected subject from the non-diseased population ( 0 correlation between the tests results of the same subjects. This paper is devoted to reviewing some existing methods of calculating AUC. It attempts to calculate AUC based on a simple new method and evaluate its significance in assessing classification models.

EXISTING METHODS OF CALCULATING AUC
There are several methods of calculating the AUC. All these methods differ in the way the distribution functions of both populations are estimated based on their sample values. In Lopez-Raton et al (2012a), a review of all these methods to estimate ROC curve and AUCs is performed. The two basic methods of estimating AUC are the parametric (bi-normal) and non-parametric (empirical).

Parametric (Binormal ROC curve) method
This estimation is based on the assumption that the test results in the diseased and non-diseased populations or some unknown monotonic transformation of the test results follows a bi-normal distribution. Alf, Jr (1968, 1969) and McClish (1989) proposed maximum-likelihood estimates (MLEs) for the parameters of a binormal ROC curve and provided parametric methods for estimating and comparing the partial AUC respectively.
According to Alf (1968), McClish (1989), Metz (1978), suppose that continuous diagnostic test results are normally distributed in the healthy and diseased populations, to be able to properly define ROC curve, let Y denote a random variable representing a continuous diagnostic test result. The diagnosis according to any cutoff value c is positive (diseased) if Y ≥ c and negative (non-diseased) if Y < c. Let D0 and D1 denote the nondiseased and diseased populations, respectively. The true and false positive rates at the cut-off value c, true positive rate, TPR(c), and false positive rate, FPR(c) are and     are the population mean and variance of the disease and non-diseased group.
For any cut-off value c, we have where  denotes the standard normal cumulative distribution function. Note also that t is the all possible FPRs according to the varying c values in (−∞, ∞). Simplifying, we have 0 0 1 0 .14 is the corresponding cut-off value. Hence The ROC curve is obtained by substituting for c in equation 1.13 to have the function If according to McClish(1989) and Metz(1978), we define InY y and InY y   be for diseased and non-diseased group respectively, since it is easier to work with normal distribution than the lognormal distribution,

Alternative method of calculating AUC.
Based on the bi-normal assumption, we know that is the cumulative standard normal distribution function (Pepe,2003;Metz,1986;Zou et al,1997;Hanley and McNeil,1982). n a n a n n a V a V a n n n n

Estimating Variance of AUC
where 1 0 n and n are the numbers of diseased and non-diseased study subjects, respectively. The variance of

AUC )
can be estimated by substituting estimators for the parameters 1 0 . a and a To estimate the variance of AUC, recall that The maximum likelihood estimate of δ is obtained substituting the MLE's of the means and variances. The MLE of AUC is found by substituting the MLE's of the means and variances into equation 1.20 and using numerical integration. The AUC now reduce to is a monotonic increasing function of  , it is enough to find the variance and standard error of ˆ.


Since  is a function of the parameters, 2 2 1 0 1 0 , , and     ,we will adopt the Delta method (Zhou et al, 2002) for finding the approximate variance and standard error for ˆ.
Hence the variance expression for  can be obtained using the following expression Where  is the level of significance and 2 Z  is the critical value of Z for a two tailed test at level of significance . 2 Some methods using the empirical ROC curve are the trapezoidal rule to approximate AUC by integration, the Mann-Whitney statistic and the empirical method by DeLong et al. (1988).

Nonparametric (conventional) Delong et al method
The empirical (nonparametric) method by DeLong et al. (1988) is a popular and best known method to compare two correlated AUCs by using the theory of generalized U statistics. They used the structural components method provided by Sen (1960) to generate consistent variance estimates of the elements of the variance-covariance matrix of a vector of U statistics, and the resulting test statistic has asymptotically a 2  distribution. This method is important as it helps to study the behavior of the type I error and the statistical power of the conventional nonparametric test for comparing two AUCs over a wide range of relevant parameters and against various alternatives. According to Delong et al. (1988), let the variance of 1 Y being the component of the ith subject from the diseased population, Y being the component of the jth subject from the healthy population, The empirical AUC is estimated as 0 1 The variance of the estimated AUC is estimated as

S and S
, the variances of the diseased and non-diseased subjects is defined as

THE TRAPEZOIDAL RULE AND MANN-WHITNEY U STATISTIC METHODS
To calculate the AUC using the Trapezoidal rule, first the ROC curve is separated into many segments, the area of each segment is computed by joining the points (sensitivity, Se, 1-specificity, Sp) at each interval value of the continuous test results and draws a straight line joining the x-axis. This forms several trapezoids and the AUC can be easily calculated directly by summing the area of the trapezoids that are formed below the connected points making up the ROC curve. According to Bamber(1975), Hanley and McNeil (1982), the area under an empirical ROC curve, when calculated by the trapezoidal rule, is equal to the Mann-Whitney two sample statistics applied to the two samples since the nonparametric analog to the t-test is the Wilcoxon rank-sum test, or synonymously the Mann-Whitney U test. Here, the possible diagnostic test results for each cutoff value c are considered, and the corresponding true is the number of subjects with test results greater than or equal to c(Y ≥ c) among the diseased subjects and s0(c) is the number of subjects with test results greater than or equal c(Y ≥ c) among the non-diseased subjects. The ROC curve is subsequently created by connecting these points with a straight line (Bamber, 1975;Hanley and McNeil,1982). The AUC of the nonparametric ROC curve is obtained using trapezoidal rule and is estimated by Since the Mann-Whitney U-statistic is based upon an estimate of , which is exactly the AUC, the properties of the Mann-Whitney U-statistic can be used to predict the statistical properties of the AUC (Hanley and McNeil,1982). The variance of the estimated AUC is computed using Mann-Whitney Statistic (Hanley and McNeil, 1982) as: is the number of true negative subjects with test results equal to 1 , y y n  is the number of true positive subjects with test results equal to 0 , y y n  is the number of true negative subjects with test results less than 1 y y and n  is the number of true positive subjects with test results greater than y (Hanley and McNeil,1982). The trapezoidal approach systematically underestimates the AUC, because of the way all of the points on the ROC curve are connected with straight lines, rather than smooth concave curves which is normally experienced when the Mann-Whitney U statistic approach is used for estimation (Zhou et al, 2002).By increasing the number of the possible cut-off points, the bias of the estimation of AUC can be significantly reduced and make it acceptable for the estimation. Hanley and McNeil (1983) showed that the area computed by trapezoidal rule under an empirical ROC curve is equal to the Mann-Whitney U statistic for comparing correlated AUCs from two samples. In general the interpretation of the AUC is the same regardless of whether trapezoidal rule or the Mann-Whitney U statistic was used. A way to compare the trapezoidal rule and the Mann-Whitney U statistic in estimating AUC is to compare their respective AUCs. In order to determine if the two AUCs are significantly different, the variances of both AUCs estimates must be taken into account.

CLASSIFICATION MODELS ASSESSMENT
Classification models discussed generally here includes logistic regression, discriminant analysis and dummy variable multiple regression because of their similarity in models and estimation techniques. Classification modeling has the purpose of finding a mathematical relationship between a response, or "dependent" variable and "independent" variables with the purpose of estimating the independent variables through which future values of the response variable could be predicted from the estimated variable(s). We shall verify the significance of AUC in assessing classification models. For instance, based on the previous works by Okeh and Oyeka (2014,2015,2016, on dummy variable regression analysis, predicted probability of positive response can be obtained and applied in calculating AUC for assessing the dummy variable regression model. This assessment is based on the p-value of the AUC calculated. So much have been said about some of these models: Ogum(2003) said that discriminant analysis is a rather powerful statistical tool when many variables (e.g. patients risk factors for diabetes) are to be considered simultaneously while Onyeagu (2003) viewed discriminant analysis as a technique concerned with the problem of classification in the sense that the output generated from these classification models belong to a certain range of values defined by the cut-off value c. This cut-off value if known or assumed can always be applied on the basis of which classification into groups of presence or diseased (coded 1) or negative or non-diseased (coded 0) is made. Applying the cut-off value c in dichotomizing a continuous or discrete outputs into the range [1,0], so that this output ( ) y x ,is set such that

METHODOLOGY
Here we propose a simpler method of calculating AUC from predicted probability of positive response obtained from classification models. We shall also propose an easy to understand method of comparing between two or more diagnostic tests in terms of their AUCs. To define the AUC, this paper adopted the pattern of Mann-Whitney (MW, 1947) statistic approach of calculating AUC based on predicted probability of positive response from classification models and observed outcomes. In particular to specify the observed subjects' outcomes, let    Vol.9, No.9, 2019 given the fact that the graph of ROC curve is within this probability range. Equation 2.2 implies that higher values of  means that more predicted outcome values will be 1 while lower values of  indicates that more predicted outcome values will be zero.

CONSTRUCTING ROC CURVE BASED ON THE  PREDICTION RULES
To construct ROC curve, let Y be a random variable representing a continuous diagnostic test result. To dichotomize these test results, we assume that there exist a cut-off value c such that test results of the population of diseased subjects(coded 1) are classified based on Y ≥ c and the population of non-diseased subjects(coded 0) are classified based on Y < c. Mathematically, using a cut-off value c to define a dichotomous test result from a continuous diagnostic tests To construct ROC curve, let a total random sample of a+c subjects be observed outcomes   n n a b c d      is the total number of subjects randomly sampled from the population.
Based on the results in a classification table 1 for each model, we carry out ROC curve analysis to measure the accuracy of the diagnostic test in discriminating between alternative health status by first calculating the sensitivity and 1-specificity since ROC curve is a graph resulting from the plotting of these values. By varying the value of  in Equation 2.2 between 0 and 1 inclusive, we generate so many predicted outcomes that can be represented in there corresponding contingency tables for each model. Based on these tables, we calculate values of sensitivity and 1-specificity that will be used in constructing an ROC curve. Specifically here, each α-rule contributes one point to the ROC curve and so since a pair of sensitivity and 1-specificity values contributes one point to the ROC curve, several pairs will be used in constructing a smooth ROC curve. This method of defining AUC represents a shift from the existing works by Buros and Tubbs (2013); Hanley and McNeil (1982) and Mann-Whitney (1947).

CALCULATING AUC BASED ON PREDICTED PROBABILITY OF POSITIVE RESPONSE
To model the AUC based on the contingency table 1, let 1 0 ij ij y and y be the diagnostic test results of the ith subject at jth diagnostic test who are drawn randomly from the diseased and non-diseased population respectively while n represents the total number of sampled observations of subjects for those responding positive (diseased) The AUC defined by equations 2.3 or 2.4 follows the non-parametric MW approach and it is rather flexible and yields ROC estimates even with a better precision than the MW approach or the trapezoidal rule for calculating the AUC. This method of calculating AUC avoids the computational complex procedures of the maximum likelihood estimation (MLE) and numerical integration methods which not only involves lengthy calculations but also have restrictive assumptions about the distribution of diagnostic test results. It is note worthy that estimates from parametric methods such as the method of MLE are inconsistent thereby giving a misleading picture of the regression relationship (Pepe, 2003). Our method of calculating AUC is unique in so many ways: it incorporates predicted probability of positive response in the construction of ROC curve and indeed AUC, it uses the  prediction rule which enables the construction of very smooth ROC curve because several values of  normally will produce smoothness in the curve, the AUC calculated is also diagnostic test dependent since the test result depends on the test for the subjects and finally the method is not only simpler and straight forward but also it avoids the iteration procedure which is rigorous, time consuming and liable to errorrneous results. The new AUC if obtained for two or more diagnostic tests can be compared using a chi-square test statistic proposed for that purpose which approximates continuous distribution to discrete distribution as seen in a contingency table.

APPLICATION TO REAL DATA
The proposed methods can be applied to real data on gestational diabetes mellitus(GDM).This was a retrospective study of test results of pregnant women screened using 1 hour 50g Glucose Challenge Test(GCT) and diagnosed using 75g OGTT as well as 100g OGTT according to WHO(1999) and National Diabetic Data Group (NDDG,1979) criteria. These test results were collected using the simple random sampling method. Medical records showed that out of a total of six thousand and ten (6010) pregnant women registered for antenatal clinic (ANC) who were screened using a universal screening with 1 hour 50g Glucose Challenge Test (GCT) for GDM in the sampled hospitals within the two years (from January 2011 to December 2012) chosen for this study, a total sample of 1113 pregnant women who had positive risk factors (such as positive family history of diabetes, age at least 30 years, BMI  30 kg/m 2 , previous fetal weight  4kg, and positive obstetric history of GDM) and aged between 15-45 years at less than 24 weeks and between 24-28 weeks of gestation tested positive for GDM(indicating the presence of GDM) since their plasma blood glucose level was at least 140 mg/dl after 1 hour. These positively responding women were subsequently recalled for confirmatory diagnostic test using 2-hour 75g OGTT in accordance with the criteria set by WHO (1985) and later repeated using 3-hours 100g OGTT during the later part of their gestation period.
The study protocol was according to the recommendations for universal screening by the Fifth International Workshop Conference on Gestational Diabetes (Metzger et al, 2007).The essence of the repeated tests was to actually determine the status of GDM in them. Since the results of the diagnosis are two, one stands the chance of comparing between their tests results (between 2-hours 75g OGTT and 3-hours 100g OGTT). Women who were known diabetics, or who were suffering from any chronic illness were excluded from the study. After obtaining permission from the hospitals' Research and Ethics Committee, assess was granted into the record units of the antenatal wards of these hospitals where the medical history of the patients were kept in a proforma containing general information on demographic characteristics such as body mass index, maternal age, previous fetal weight and vital clinical histories such as obstetric history of GDM, and family history of diabetes were taken. BMI was calculated by dividing the weight in kilograms by the height in meters squared.
The data for this paper is recorded according to the number of pregnant women, GDM test results (response variable) and their observed risk factors or parent independent variables. This is suitable for fitting the classification models for analysis. Since we have two set of response variable from the two diagnostic tests, we took the average of the test results as observed outcomes. Analysis of this data will yield the predicted probability of positive response needed to be applied in calculating the AUC for the models.

PREDICTED PROBABILITY OBTAINED FROM THE CLASSIFICATION MODELS
We here calculate predicted probability of positive response from the classification models for the purpose of calculating the AUC. Fitting the data to the models, the analysis yielded the results shown in table 2. Having obtained the predicted probability of positive response for each of the models as in table 2, we now use each of them to find the predicted outcomes as defined in equation 2.2 as well as the observed outcomes as defined in equation 2.1. Since ROC curve has the ability of evaluating the discriminatory power of a continuous test result (observed outcome) to correctly assign into a two-group classification, the observed outcomes is dichotomized using at least 7.8mmol/l as cut-off value for GDM diagnosis as recommended. Having generated the observed outcomes based on equation 2.1 and predicted outcomes based on equation 3.2, the resulting coded data from these outcomes are cross-classified and will constitute as many tables as contingency table 2.5 for constructing ROC curve for each model. For instance, the contingency table 3 represents a pair of sensitivity and 1-specificity out of 1113 pairs required for constructing the ROC curve for each model.

n n 
Suppose each α-rule (for instance, table 3) contributes one point to the ROC curve, the estimates of sensitivity and specificity obtained from this table for generating the point is not enough to be used to obtain sufficient pairs of sensitivity and 1-specificity that will enable for the actual smooth ROC curve analysis for each model. In order to have sufficient estimated pairs of sensitivity and 1-specificity that can generate a smooth curve, we vary the  value from 0 to 1 in the prediction rule of equation 2.2 so as to generate multiple prediction rule capable of being used in obtaining enough contingency tables for the construction of ROC curve. This computation is supported by SAS version 9 soft ware. The difference between one model and another model is the predicted probability of positive response for that model. The number of pairs of sensitivity and 1-specificity equals a sample size 1113 for each model. Using this method, the AUCs are obtained for the models.

ASSESSING THE DISCRIMINATING ABILITY OF CLASSIFICATION MODELS
Assessment of classification models here is based on their discriminating or predictive ability of diagnostic accuracy. Since many models are available for assessment, interpreting and comparing the models using the ROC curves may be erroneous, instead the interpretation and comparison of the discriminatory accuracy of the test will be based on the AUC which summarizes the accuracy of each model. It is vital to note that these models were chosen because of their similar set of modeling techniques. The estimators of AUC for the selected models are estimated using both parametric and non-parametric methods. From Table 4, result shows that little differences exist among the non-parametric estimates than the parametric estimates. The highest difference in AUC can be seen in linear discriminant analysis. This may be due to its strict compliance to the normality assumption as well as equal variance. . This test suggests that if the chosen alpha level is 0.05 and the p-value is less than 0.05, then the null hypothesis that the data are normally distributed is rejected. If the p-value is greater than 0.05, then the null hypothesis is accepted. Similarly, two test results such as X(diseased) and Y(non-diseased) may not follow normal distribution according to Krzyśko et al (2008), but the fact that parametric binormal ROC curve suggests that two test results such as X(diseased) and Y(non-diseased) must each follow normal distribution always give good results, due to the fact that ROC curves concerns itself with the relationship between distributions instead of the individual distributions.  Table 6 shows the p-values for the difference between two AUCs when the predicted probability method of calculating AUC is used. The null hypothesis of no difference in AUCs is rejected when any of the p-value is smaller than the level of significance ( 0.05

 
). Table 2.8 showed that the AUCs for linear discriminant analysis did not show any significant difference because the assumptions of normality and equal covariance are not met while dummy variable regression analysis followed by logistic regression analysis showed significant differences in AUCs indicating their flexibility in handling data when the assumption of normality and equal variance is violated.

SUMMARY AND CONCLUSION
The purpose of this paper was to evaluate the methods of calculating AUC and its significance in assessing classification models. In assessing these classification techniques, it was discovered that non-parametric methods of estimating AUC give convergent results in terms of their measurement for AUC while parametric approaches are known generally for the higher values of AUC. Practically, the assumption of normality cannot be achieved for the parametric methods of estimating AUC. This is why the non-parametric methods are recommended. The strength of our method is that it has easy implementation to discriminate diagnostic test procedures even by nonstatisticians. The proposed method offers reliable statistical inferences even in small sample problems and circumvent the difficulties of deriving the statistical moments of complex summary statistics.

DISCUSSION
The method of calculating AUC from predicted probability of positive response avoids the computational complex procedures of the maximum likelihood estimation (MLE) and numerical integration methods which not only involves lengthy calculations but also have restrictive assumptions about the distribution of diagnostic test results since there are parametric methods. It is note worthy that estimates from parametric methods such as the method of MLE are inconsistent thereby giving a misleading picture of the regression relationship (Pepe, 2003). Our method of calculating AUC is unique in so many ways: it incorporates predicted probability of positive response in the construction of ROC curve and indeed AUC, it uses the  prediction rule which enables the construction of very smooth ROC curve because several values of  normally will produce smoothness in the curve, the AUC calculated is also diagnostic test dependent since the test result depends on the test for the subjects and finally the method is not only simpler and straight forward but also it avoids the iteration procedure which is rigorous, time consuming and liable to errorrneous results. Computational burden can still be substantial in binormal ROC curve as a method of calculating AUC because a number of iterative procedures that are involved in obtaining estimators, for instance MLE of AUC (Dorfmann & Alf 1969;Metz et al, 1998).

RECOMMENDATIONS
The new method of calculating AUC from predicted probability is recommended because of the reasons given above. It is highly recommended that non-parametric method of calculating AUC such as the predicted probability method be employed because of the fact that it is always distribution free and heavy computational procedures are not required. It is also advised that dummy variable regression be utilized in classifying disease conditions since it not discriminates well, it determines the contributions of the various levels of the parent independent variables.