Differential Item Functioning in English Language Test Using Item Response Theory for Ethnic Groups

The study investigated detecting differential item functioning using item response theory in West African Senior School Certificate English language test in South-South Nigeria. Four research questions and one hypothesis were formulated to guide the study. Descriptive research survey design was employed for the study. The population of the study was 117845 Senior Secondary 3 students in Edo, Delta, Rivers and Bayelsa states. A sample of 1309 (604 males, 705 females) drawn through multi stage sampling technique was used for the study. One valid instrument titled: WASSCE/SSCE English language objective test (ELOT) was used to collect data for the study. The reliability index of the instrument was estimated using Richard Kuderson 20 with coefficient value of .84 for the English Language objective test Chi square and Lord Wald test statistics statistical technique employed by Item Response Theory for Patient Reported Outcome (IRTPRO) was the technique used in data analysis which provided answers to the research questions and Chi Square test to test the hypothesis at.05 level of significance. On analysis, the result revealed that significantly between Ijaw, and Bini, 20 items were flagged as exhibiting DIF, between Esan and Ijaw, 20 items showed DIF, between Isoko and Ijaw, 15 items showed DIF, between Urhobo and Ijaw, 12 items were flagged as showing DIF and the number of items that function differently was significantly dependent on ethnic groups. This thus shows a total of 95% based on ethnicity indicating large DIF and items that are potentially biased. Based on the findings recommendation were made and one among others was that Item Response theory should be used as DIF detection method by large scale public examination and test developers. Keywords; Differential item functioning, Item Response Theory, Ethnicity, English language, Examination DOI: 10.7176/JESD/10-6-02 Publication date:March 31 2019 Introduction Identification of biased item is of extreme necessity in the creating of equitable and content valid test or examination. Thus of great value or importance is the subject of fairness of test items to test takers. Examination and test are processes intricately tied to education and education is one powerful instrument for the development of man and society. According to Enamiroro (2007), education has as its aim manpower development and is geared at national growth and development. And as such, it has been embraced and welcomed by most government of the world and the Nigeria government has not exempted itself from using this valuable tool in developing its nation. In pursuit of national growth and development, the Nigeria government has set up objectives that will guide education. These objectives are stated in the Federal Republic of Nigeria (2004) National Policy on Education and one of these objectives is the creating of a just and egalitarian society upon which the foundation of education will operate on among other stated objectives. In pursuit of this, there emerged the development of a national curriculum that provides the platform in which interest of children from different social-cultural background, race, gender and ethnic group, are catered for to bring about equal developmental opportunity. There exists as well as terminal examinations conducted by different bodies such as West Africa Examination Council(WAEC), National Examination Council (NECO) and other bodies in which different examinees of same ability and trait from different language, culture, gender, race, ethnical group, geographical location, cultural and socio-economic background are subjected to. These examinations are supposed to measure what they are designed for and are not be unfair to examinees from different groups. According to Roever (2005) a fair test is one that has comparable validity for all groups of individuals and that gives all test taker same opportunity to demonstrate the skills and knowledge which they have acquired and which are relevant to the test’s purpose. If a test favours one group of test-takers over another, the principle of test fairness may appear to have been violated and this has great implication as test or examination results play a great role in the Milieu of high stakes decision-making, which include, admission aptitude achievement and certification and so on. Items that make up a test are neutral inconsequential tools and it remains so until some sort of significance is assign or attributed to the results derived from them. Once this significance is attached to a person’s score, the individual will experience some repercussions, ranging from superficial to life changing. These repercussions may be fair or unfair, helpful or disastrous, appropriate or misguided depending on the significance attached to the test score (Gregory, 2004). Therefore, it is imperative and essential that fairness principle is evident in a test to all and not biased against any group. However, in some instances, items in a test are biased based on the fact that characteristics not part of a construct being measured forms aspect of test and the Journal of Economics and Sustainable Development www.iiste.org ISSN 2222-1700 (Paper) ISSN 2222-2855 (Online) DOI: 10.7176/JESD Vol.10, No.6, 2019 7 fallout is that performance is affected. (Abiam 1996: Nworgu & Odili 2005; Uwhekadom 2014) Bias as seen by Brown (2012) happens when examinees of one groups are likely to answer an item correctly (or endorse an item) than examinees of another group due to of some characterizes of the test item that is not material to the construct being measured. Bias as defined by Hambleton and Rodgers (1995) is the presence of some features of an item that results in differential performance for person of the equal ability but from different ethnic, sex, cultural or religious group. This issue regarding test bias has become the subject of enormous research and a method called differential item functioning (DIF) has evolved to become a new standard in scientific analysis of bias (Zumbo 1999). In the psychometric literature, the term (DIF) was created to define concerns about item bias existing in the context of test bias (Lee 2015). The term (DIF) explains the scientific or empirical evidence used to refute or back up bias. Abedalaziz (2011), views that “DIF” is a collection of statistical methods utilized to determine if examination items are appropriate and fair for testing the knowledge of various subgroups of examinee. Caroll, (2015) viewing posited that when individuals from different groups (gender, majority/minority, SES, etc.) perform differently on a test item, this difference in the item score, above and beyond group differences on the construct, is referred to as DIF. Broadly defined, DIF refers to the presence of differences in individual item characteristics across groups, which can be viewed graphically as differently sloped or horizontally shifted item characteristic curves (ICCs) when the item parameters for each group are plotted (Lord, 1980; Thissen, Steinberg, & Wainer, 1988). These differences in the parameters of the items are seen to show empirical differences in an item’s tendency to accurately or otherwise estimate an examinee’s standing on a latent trait (often ability) among different groups. Thus, such differences may be sources of bias in the form of unfair advantages for some groups and unfair disadvantages for other groups (Carroll, 2015). Differential Item Functioning (DIF) is a violation of the invariance assumption in Item Response Theory (IRT) models and happens when the probability of endorsing an item for test takers of equal ability level varies in different groups (Battuaz, 2015). DIF refers to difference in item functioning after groups have been matched with respect to ability (Wiberg 2007). DIF essentially deals with individuals from different subgroups or membership with the equal ability or proficiently having different probability of correctly answering an item (Anastasi & Urbina, 2007; Kamata & Vaughn 2004; Magis & Falcon 2011).. It deals with the differences in the functioning of items across groups with dissimilar cultural or experiential backgrounds (often times demographic) which are matched on the underlying trait being measured by the items (Anastasi & Urbina, 1997; Camili, 2006). DIF describes the empirical evidence used to back bias. DIF is “Prima facie” evidence that the possibility that the test is biased exists (Karami &Nodoushan 2016). As such DIF helps in the identification of test items that are potentially biased (Perrone 2006). That an item is flagged or has evidence of displaying DIF does not necessary mean bias. Alongside the statistical evidence of DIF, the process of sensitivity of test content review can lead to the identification of Bias. In DIF, there are at least two groups, (usually labeled focal and reference group); In general, DIF is examined by comparing item responses for these groups of examinees. In most applications, these groups represent types of examinees based on demographic characteristics such as gender or race, the focal group is the potentially disadvantaged and the referenced group is the group which is assumed to be potentially advantaged by the test, however, naming or terming the group is not always clear cut it can be arbitrary (Finch & French, 2008; Karami & Nodoushan 2011 ; & McNamara & Roever, 2006). Over the years, various methods have been applied in flagging DIF and widely of these methods is the context of the item response theory (IRT) approach. IRT detection methods like the estimating item parameters (Lord Wald Chi test), likelihood function, Area methods and so on which are used in identifying items that are potentially biased provides alongside p-values, ICC curves or trace lines depending on the parameter model. (Finch & French, 2007; Kim & Cohen, 1995; Lee, 2015; Oshima & Morris, 2008; Woods, Cai, & Wang, 2013). IRT approaches propose a latent tr

7 fallout is that performance is affected. (Abiam 1996: Nworgu & Odili 2005Uwhekadom 2014) Bias as seen by Brown (2012) happens when examinees of one groups are likely to answer an item correctly (or endorse an item) than examinees of another group due to of some characterizes of the test item that is not material to the construct being measured. Bias as defined by Hambleton and Rodgers (1995) is the presence of some features of an item that results in differential performance for person of the equal ability but from different ethnic, sex, cultural or religious group. This issue regarding test bias has become the subject of enormous research and a method called differential item functioning (DIF) has evolved to become a new standard in scientific analysis of bias (Zumbo 1999). In the psychometric literature, the term (DIF) was created to define concerns about item bias existing in the context of test bias (Lee 2015). The term (DIF) explains the scientific or empirical evidence used to refute or back up bias. Abedalaziz (2011), views that "DIF" is a collection of statistical methods utilized to determine if examination items are appropriate and fair for testing the knowledge of various subgroups of examinee. Caroll, (2015) viewing posited that when individuals from different groups (gender, majority/minority, SES, etc.) perform differently on a test item, this difference in the item score, above and beyond group differences on the construct, is referred to as DIF. Broadly defined, DIF refers to the presence of differences in individual item characteristics across groups, which can be viewed graphically as differently sloped or horizontally shifted item characteristic curves (ICCs) when the item parameters for each group are plotted (Lord, 1980;Thissen, Steinberg, & Wainer, 1988). These differences in the parameters of the items are seen to show empirical differences in an item's tendency to accurately or otherwise estimate an examinee's standing on a latent trait (often ability) among different groups. Thus, such differences may be sources of bias in the form of unfair advantages for some groups and unfair disadvantages for other groups (Carroll, 2015).
Differential Item Functioning (DIF) is a violation of the invariance assumption in Item Response Theory (IRT) models and happens when the probability of endorsing an item for test takers of equal ability level varies in different groups (Battuaz, 2015). DIF refers to difference in item functioning after groups have been matched with respect to ability (Wiberg 2007). DIF essentially deals with individuals from different subgroups or membership with the equal ability or proficiently having different probability of correctly answering an item (Anastasi & Urbina, 2007;Kamata & Vaughn 2004;Magis & Falcon 2011).. It deals with the differences in the functioning of items across groups with dissimilar cultural or experiential backgrounds (often times demographic) which are matched on the underlying trait being measured by the items (Anastasi & Urbina, 1997;Camili, 2006). DIF describes the empirical evidence used to back bias. DIF is "Prima facie" evidence that the possibility that the test is biased exists (Karami &Nodoushan 2016). As such DIF helps in the identification of test items that are potentially biased (Perrone 2006). That an item is flagged or has evidence of displaying DIF does not necessary mean bias. Alongside the statistical evidence of DIF, the process of sensitivity of test content review can lead to the identification of Bias.
In DIF, there are at least two groups, (usually labeled focal and reference group); In general, DIF is examined by comparing item responses for these groups of examinees. In most applications, these groups represent types of examinees based on demographic characteristics such as gender or race, the focal group is the potentially disadvantaged and the referenced group is the group which is assumed to be potentially advantaged by the test, however, naming or terming the group is not always clear cut it can be arbitrary (Finch & French, 2008;Karami & Nodoushan 2011 ;& McNamara & Roever, 2006).
Over the years, various methods have been applied in flagging DIF and widely of these methods is the context of the item response theory (IRT) approach. IRT detection methods like the estimating item parameters (Lord Wald Chi test), likelihood function, Area methods and so on which are used in identifying items that are potentially biased provides alongside p-values, ICC curves or trace lines depending on the parameter model. (Finch & French, 2007;Kim & Cohen, 1995;Lee, 2015;Oshima & Morris, 2008;Woods, Cai, & Wang, 2013). IRT approaches propose a latent trait or ability that underlies the item responses and shares the use of the estimate of the latent trait as a matching variable rather than the observed score. Ertuby and Russele (1999) suggested that because of their greater sophistication, IRT, procedures provide the best results for identifying items that are biased. IRT usually offers a robust approach using item characteristics curve (ICC) for identifying DIF. The ICC reveals for each item, the differing probabilities of answering a particular way contingent on the degree of the latent construct. ICCS are obtained from a function that identifies the relationship between an individual's location on a latent trait and the probability of the individual getting an item on a test (Siebert, 2013). Wiberg (2007) viewed that if the item does not show DIF, the ICC would be identical while DIF is present when ICCS for the two group differs.
DIF is usually in magnitude and it varies in degree which can be measured by examining parameters or statistics linked with the method used in detecting DIF. Levels of DIF magnitude ranges from 0.25, 0.50, 0.75 and 1.00 as the largest magnitude of DIF representing small to large DIF magnitude (Fidalgo, Ferreres & Muniz 2004;Hidalgo & Lopez-Pina 2004;Parshall & Miller 1995;Stephen-Bonty 2007). Empirical studies have shown that the percentage of DIF items ranges from quite small 1.5% to overwhelming large (64%). Studies considers it a small amount when a test contains less than 10% DIF, a medium amount of DIF when a test contains 10 to 30%, large DIF when it exceeds 30% when the parentage of DIF exceeds 10% closer attention should be paid to it (Hambleton & Roger, 1989;Raju 1989;Xioting 2010).
The analysis of DIF provides a convenient starting point for the study of item bias. These assertions reveal a general agreement that an item that functions differentially may not be biased but biased item must function differentially (Brown 2012;Cohen 2006, Hambletbon & Rodgers 1995Karami & Nedoushan 2011Schumacker 2005;Perrone 2006;Williams 1997;Zumbo 1999;Wiberg, 2007).
Items can function differentially for any subject matter including English language. English Language is a well-known language spoken by many that occupies a primal function at the work place, serves as a vehicle of instruction in schools and the language of text-books (Aina, Ogudele & Olanipekun, 2013: Adekola, Shoaga&Lawal 2015. The status assigned English in Nigeria Education is specified in the 2004 National Language Policy on Education and this states that English shall be the medium of instruction in the upper primary, secondary and tertiary/higher level of education. (Adekoya, Shoaga & Lawal 2015). Thus the status of English is enhanced as it's not only a subject of study in school but also the language of instruction. Also the Government of Nigeria (2004) stipulates in her National Policy on Education the significance of English language as one of the core subjects that will enable an individual to gain admission into any higher institution thereby making it a course that holds the key to further academic progress. To have the chance to study any discipline in the University, Polytechnic or College of Education, an individual must have obtained a credit in English language in WASSCE or its equivalent.
English language although a foreign language to Nigeria, in the sense that it is not indigenous, is widely used in Nigeria as the second language (L2), ever since it was introduced being that Nigeria is a multilingual country comprising of diverse ethnic groups with diverse languages as their first language (L1) this however has in no way diminished the significance of English Language as earlier stated.
Despite all the roles and immense significance of English Language as earlier stated, there has been a consistency poor performance in English language over time in large scale examination like WAEC with 29.99% (2009), 23.36% (2010), 30.9% (2011), 38.81% (2012), 36.57% (2013), 31.28% (2014), and 38.68% (2015) obtaining five credit and above in English Language and mathematics (Vanguardonline 6/8/16; Daily Trust 21/8/2014). This shows that a high percentage of students enrolling for English Language records or obtains below credit in the past few years. As one of the examination subject in WEAC, the English Language paper is usually divided into three sections requiring examinees to respond to. This covers the basic skills of reading, speaking, writing, and listening that Daladi established in Umar Sa'ad & Usman (2014) as the basis skills that English Language comprises of. The essay and objective test format is usually adopted by West Africa Examination Council in setting questions in English Language.
The Multiple choice test items which is a type of objective test or a form that requires testees to choose or select one response from a set of many alternatives (Orluwene, 2012;Opara, 2016). Assessment of Students via this form is one form that all schools not only large scale examinations like WEAC employs in assessing testees who takes English Language as subject of study and also in other subjects as it is the language of instruction in all educational institutions irrespective of one's group membership. Important is that English Language examinations regardless of the body conducting it should have items that are fair, unbiased and not functioning differentially for any group from different socioeconomic status, gender, race, geographical location, ethnic group and so on, on any subject matter..
Ethnicity as a concept involves some form of metaphorities of kinship, especially the notion of common ancestry and blood relationship. It involves some form of identification, individual identify themselves as belonging to a certain group and the group recognizes individual as belonging to that group. Bottomley (1997) opined that cultural practices such as the language and religion define the particularities of various ethnic groups. Thus each ethnic group has a native tongue which is considered as (L1) and English language (L2) which is the language of learning and commerce. Various ethnic groups exist in within the Nigeria environment numbering up to 350. Some of the ethnics groups that exist include Ijaw, Yoruba, Isoko, Bini, Igbo, Esan, Urhobo, Yoruba, Efik and so on. Peculiarities in words pronunciation and words usage of some ethnic may cause variation in academic performance especially English language which is a second language that most students are been exposed to after being exposed to their mother tongue which is first language. Thus Fatemi & Khaghanzhad, (2011) asserted that ethnic difference may cause great variation in test taker performance on test of English language in large scale examination This variation can be in the form of items functioning differently for some ethnic subgroups. Ethnic difference may cause great variation in test taker performance on test of English language in large scare examination where test takers take it as (L2) ( Fatemi & Khaghanzhad 2011). An individual who grows up within a particular ethnic group will absorb the native tongue of that group which may influence proficiency in English language.
Large scale examination like WASSCE is taken by individuals from diverse ethnic groups and research though foreign has revealed that performance of test takers from different ethnic groups differs thus it's imperative that performance of students of same ability from different ethnic groups who take English language as a subject be examined.
It is imperative that items do not function differently for any group. This is because test results are literally seen to be strong indicators of people's ability levels and performance in subjects (Whitmore and Schumaker, 1999).
However different studies conducted have shown that some tests used in some public examinations in Nigeria and other large scale examination contain items that exhibit DIF to the advantage or disadvantage of different subgroups. For instance, in a study by Reuben & Akorede (2016) titled differential item functioning technique for detecting item bias in economics among secondary school students in Abuja metropolis, using a sample of 750 through muti -stage sampling were able to identify 3 items in the NECO (SSCE) 2013 economics objective test items that functioned differentially for groups from 2 different socio-economic status, five (5) items with regards to gender and twenty-one (21) items show DIF in regards to school location.
Ogbebor & Onuka (2013) investigating differential item functioning method as an item bias indicator in Delta state using a sample size of 447 SS3 students showed that the National Examination Council (NECO) Economic questions for 2010 had 18 items that functioned differentially for examines based on school type and school location. Uwhekadom (2014) reported that chemistry multiple choice question used by WAEC, in the 2009, 2010 and 2011 SSCE contains test items that significantly functions differentially for students of different gender, high and low socio-economic status urban and rural geographical location test takers.
Queeensoap (2014) in a study application of differential item functioning in detecting item bias in chemistry achievement test in Nigeria used a sample size of 400 administered a chemistry achievement test. Statistical and content analysis was done with logistic regression statistics; Mantel-Haenszel's adjusted DIF, Scheunneman modified chi-square, and item characteristics curve. The modified Scheunneman Chi-square showed 30 items exhibiting DIF, the Mantel-Haenszel adjusted DIF statistics flagged all items as showing significant DIF between the focal group (Ijaw) and reference group (Yoruba, Hausa & Igbo).
With the foregoing, the subject of items functioning differentially and as such, a potential root of bias in high stakes examinations is of great concern. Furthermore, item functioning differentially for individuals of various subgroups of same ability/trait level has great implication on policy, administration and classroom levels where test results constitute the ground for decision making. Test with different functioning items could bring about low achievement for a minority groups in a subject matter and this can mutilate the meaning of test result and decision that is hinged on it for some groups especially core subject like English which is compulsory criteria for further educational advancement. This thus is a concern as most often public dismay welcome the release of Senior School Certificate Examination (SSCE) as a consistent poor performance in English Language examinations has been observed over time with 29.99% (2009), 23.36% (2010), 30.9% (2011), 38.81% (2012), 36.57% (2013), 31.28% (2014), and 38.68% (2015) obtaining five credit and above including English language in WASSCE in recent years from examinees from different subgroups groups. The problem of this study as seen by the researchers is, are there items in English language multiple choice test used by WEAC in the Senior School Certificate Examination that differentially functions for candidates with equal ability from different, ethnic group such that they contribute to poor performance in the subject?
The following research questions guided the study 1.
Which are the items in the English language multiple choice that function significantly differentially between the focal group (Ijaw) and reference group (Bini)? 2.
Which are the items in the English language multiple choice that functions significantly differentially between the focal group (Ijaw) and reference group (Esan)? 3.
Which are the items in the English language multiple choice that functions significantly differentially between the focal group (Ijaw) and reference group (Isoko)? 4.
Which are the items in the English language multiple test that function significantly differentially between the focal group (Ijaw) and reference group (Urhobo) students in the English language multiple choice? The null hypothesis was tested at 0.05 level of significance: 1.
The numbers of items that function differently in English language multiple choice test are not significantly dependent on the ethnic groups

Method
The descriptive survey research design was used for this study. The population of the study consisted of one hundred and seventeen thousand, eight hundred and forty-five (117,845) Senior Secondary three (3) students in 1190 public Secondary Schools who are studying English language as a certificate subject in the 2016/2017 academic session in Delta, Bayelsa, Rivers and Edo State. The sample size of the study was 1309 students of the population who are studying English language as a certificate subject. A multistage sampling technique was employed for the study at different stages and several sampling techniques like simple random, cluster and stratified, was employed. The instrument for the study has two sections with the first section yielding information on the demographic information of the respondents on such information like sex, state, ethnicity, name of school. While the second part of the instrument for the study was English Language Objective Test (ELOT) which was based on the WASSCE/SSCE English language paper one used in the 2016 examination. This contains 70 multiple choice type questions constructed by subject experts and developed by WAEC into test form. This instrument was employed to detect items that differentially functions. The face and content validity of English language WAEC / SSCE used in 2016 has been established based on the fact that the questions were owned by an examining body WAEC/SSCE and has been validated by experts of the examining body through statistical techniques. The reliability of the coefficient of the instrument English Language Objective Test (ELOT) was established using the Kuder-Richardson (KR20) with internal consistency coefficient of 0.84 been the coefficient obtained. The instruments for the study were directly administered to the respondents on individual basis in their classes and were retrieved on the spot. After administration and retrieval, of the 1400 instruments administered to the 1400 initial sample size only 1309 instruments were properly filled this was what was used for analysis and the final sample size was 1309. Every statistical model requires assumptions about data to obtain viable parameter estimates in the data analysis procedures. In checking for the assumptions, IRTPRO first conducts a unidimensional IRT analysis to check for unidimensional assumption of IRT using confirmatory factor analysis. On meeting this assumption, local independence assumption holds as data that meets unidimensionality assumption also meets local independence assumption.
For IRT assessment of Model-Data Fit and item calibration, the 2-Pl was used as seen by the. Chi-square likelihood ratio Goodness of fit statistics.. To equate scale, in the first equating the Item Response Theory for Patient Reported Outcome (IRTPRO) software equates the ijaw as the focal group while the urhobo the reference group, in the second equating, the Ijaw was the focal group and Isoko the reference group, in the third equating, the Ijaw was the focal group and the Bini the reference group, in the fourth equating, the Ijaw was the focal group and the Esan the reference group. On item calibration (estimation of item parameters and examinee ability), Lord Wald Chi test which is used by IRTPRO software developed by Cai, Thissen & Toit (2011) was employed to detect items that functions differentially between the focal and reference groups as seen by the P-value (tested at 0.05 sig level) for both the reference groups and focal groups and their χ2 value.
The numbers of items that function differently in English language test is not significantly dependent on ethnic groups 2.
To test the null hypothesis, data was subjected to Chi Square test and the result is represented in table 5  5 shows that the expected counts for all the ethnics groups are greater than 5. It further shows that the residual for both the expected and observed for all the ethnic groups are same. The Chi square χ2 value of 264.22 (3) p < 0.5, i.e. p = .000 is less than 0.05 and this is statistically significant at the chosen alpha level of 0.05 for all the ethnic groups. Therefore, the numbers of items that function differentially in English language test is significantly dependent on the ethnic group as p < 0.05 i.e. p is .000 is less than .005. The null hypothesis that the numbers of items that function in English language test is not significantly dependent on the ethnic group is rejected and the alternate accepted

Discussion of Findings
Findings from this present study reveals that of the 70 items based on ethnicity that is between the Ijaw focal group and the Binis, Esan, Isoko, Urhobo reference group, 18 items functioned differentially significantly between the Ijaw focal group and Bini reference group representing 25.%, 17 items functioned differentially significantly between the Ijaw focal group and Esan reference group representing 24%, 13 items exhibited DIF significantly between the Ijaw focal group and Isoko reference group representing 18% and 12 items were identified as significantly exhibiting DIF between the Ijaw focal group and Bini reference group representing 17.1% as seen from their Wald Chi-square values which were greater than the critical value of 5.99 at df 2 as well as their pvalues were all significant at .05 (p< .05).
On the whole, 40 items were flagged as showing DIF based on ethnicity this represents as well a percentage of 57% of the total percentage for the 70 items. Implication is that the English Language achievement test used in WASSCE 2016 contains items that significantly functions differently. Empirical studies have revealed that the percentage of DIF items ranges from quite small 1.5% to overwhelming large (64%).
Studies consider it a small amount when a test contains less than 10% DIF, a medium amount of DIF when a test contains 10 to 30%, large DIF when it exceeds 30% when the parentage of DIF exceeds 10% closer attention should be paid to it (Hambleton & Roger, 1989;Raju 1989;Xioting 2010). That the items contain 50% items flagged as showing DIF indicates a large amount or magnitude of DIF.
That an item that functioning differentially may not be biased but biased item must function differentially This indicates that some of the items have potentials to be biased as all biased items must differentially function and DIF is the empirical evidence used to refute or support bias (Brown 2012; Cohen 2006, Hambletbon & Rodgers 1995Karami & Nedoushan 2011Schumacker 2005;Perrone 2006;Williams 1997;Zumbo 1999;. Reason why there are items functioning differently for the groups could probably be disparity in the two groups of examinees exposure to the content or vocabulary contained in the item. In consonant with this is the study by Queeensoap (2014) whose research result showed 30 items exhibiting DIF, between the focal group (Ijaw) and reference group (Yoruba, Hausa & Igbo). With 1% showing DIF for Ijaw focal group and Yoruba reference group, 17% showing DIF for Ijaw focal group and Igbo reference group and 7% showing DIF for Ijaw focal group and Hausa reference group. Similar to this is the investigation by Hambleton and Rogers (1989) where the result showed that some of the items functioned differently against white test takers from the rural settlement in the mathematical and verbal component of SAT. Also in consonant with the findings of this present work is the findings of Yang & Jones (2007) whose results indicated measurement bias attributable to race which was significant for two (CES-D) items.
Result however by Engelhard, et al (2013) revealed that overall there did not appear to be any item subsets functioning in an unexpected way across the subgroups of persons (gender, race, ethnicity and best language subgroup). This divergent finding can be attributed to the fact that this current study is not foreign while that of Engelhard, et al (2013) is foreign, another reason could be that the later used Maentel Haenszel which is an observed score method technique of detecting DIF while this current study employed an IRT based method using IRTPRO that employs Wald test.

Recommendations
Based on the findings of the study, the following recommendations were made; 1.
Items identified as showing DIF in a large percentage in large scale or public examinations should be further investigated using qualitative analysis content analysis by subject matter experts. On such investigations, such DIF items can either be edited or eliminated from a test or item bank if it is ascertained Journal of Economics and Sustainable Development www.iiste.org ISSN 2222-1700 (Paper) ISSN 2222-2855 (Online) DOI: 10.7176/JESD Vol.10, No.6, 2019 20 to exhibit bias Test developers, and large scale examination bodies like WAEC, should employ more IRT framework especially specialize IRT software like IRTPRO in detecting items that are differentially functioning for testees so that test items are valid, reliable and useable and thus reduce if not eliminating bias that may exist in test items for examinees of different groups WAEC and other public examination bodies should analyze their items for DIF before building it into their test bank.

2.
Psychometricians, government private firms and other stake holders should implore DIF in detecting bias items. For certifications and admission into higher schools, government should make laws that encourage fair use of test scores as this will help handle the interest of examinees of matched abilities from diverse subgroups like ethnicity, socio-economic status, gender, location and the likes Test developers and examining bodies that are working within only CTT measurement framework should incorporate IRT framework into their test development framework.