Test Length and Sample Size for Item-Difficulty Parameter Estimation in Item Response Theory

The study investigated test lengths and sample sizes in the accurate and stable estimation of item-difficulty parameter in the Item Response Theory (IRT) One Parameter Logistic Model (1PLM). Real data of students that sat for the June/July 2015 Economics Multiple-Choice Examinations in Edo State was obtained from the National Examinations Council (NECO), Nigeria. The statistical population of examinees were 5,158 and the test length 60. Sample sizes of 200, 500, 1000, 2000 and 5000 were randomly drawn from the population with replacement; these samples were each paired with test lengths of 10, 20, 30 and 50.all amounting to 20 statistical conditions (5 sample sizes× 4 test lengths)The parameter estimates were generated using the eirt Item Response Theory Assistant for Excel. The generated item-difficulty parameter using the entire population was assumed to be the true parameter value against which others were compared, using the Root Mean Square Error (RMSE) as an evaluative criteria. The acceptable RMSE was ≤ 0.33. Conclusion reached was that for an accurate itemdifficulty parameter estimate in the 1PLM at least a test length of 10 and sample size of 1000 is required.


Introduction
Item Response Theory (IRT) also known as Modern Test Theory (MTT) is in the class of the Latent Trait Theory (LTT), it is a psychometric framework for item analysis and test development. It is a theory that puts item quality as well as the examinees abilities into considerations, when evaluating the psychometric properties (difficulty, discrimination and guessing parameters) of items in a scale.
IRT item parameter estimation entails a complex mathematical computation; though the use of computer soft-wares has reduced the rigour associated with the computation. However different available soft-wares use different estimation techniques, but the issues as noted by psychometric researchers is what constitute adequate number of items (test length) and sample size for an accurate parameter estimation. Hambleton, (1989) asserted that, test length and sample size needed for an accurate IRT item-parameter estimation is difficult to determine.
In IRT measurement framework there are at present three popular model of the dichotomous response category: the One-Parameter Logistic Model (1PLM) , Two-Parameter Logistic Model (2PLM) and Three-Parameter Logistic Model (3PLM) depending on the number of parameters ( discrimination, difficulty and guessing) that is of interest In some researches or under some investigative situations the interest of the researchers may be just on determining the difficulty level of items without regard for the discrimination and guessing as in the "Rasch Tradition". In a study conducted by Stone (2003) he reported that sample size is a major factor in obtaining stable parameters estimates when the Rasch model/1PLM is to be fitted to a data set, however, sometimes large numbers of examinees may not be available most especially in small scale testing as in the administration of the teacher-made-test. This condition should not prevent psychometricians from benefiting from the gains of IRT. Therefore what should be the minimum test length and sample size from an empirical point of view using real test data for accurate and stable difficulty-parameter estimation?
In this study the researchers deem it fit to empirically sample different test lengths against varying sample sizes in the estimation of the difficulty-parameters under the 1PLM.
Many psychometric researchers have published works on the effect of sample size and test length on the psychometric properties (Discrimination, Difficulty and Guessing Parameters) of items as well as the ability parameter of examinees, for example: Stone (2003) did a study to determine the effect of sample size on the accuracy of item-difficulty parameter. In that study he had sample sizes ranging from 10 to 3,000 taken from an examinee population of 3,173. The samples were randomly selected into estimation conditions, the WINSTEPS statistical software was used in estimating the item-difficulty parameters. The estimated item-difficulty parameters attained an acceptable value and begin to converge only when the sample size got up to 500.
Akour, and AL-Omari (2013) conducted a similar study in Jordan they used sample size of 200, 500, 1000, 5000, 10000, and 20000 against test lengths of 15, 30 and 60. Data used in the study was an operational Mathematics data from a test that was conducted by the Ministry of Education in Jordan; about 40,000 testees Journal of Education and Practice www.iiste.org ISSN 2222-1735 (Paper) ISSN 2222-288X (Online) Vol.10, No.30, 2019 took the test. The 3LP model was fitted and they concluded that a test length of greater than 15 and sample size greater than 500 are needed for an accurate and stable item-difficulty parameter estimate. Custer (2015) in a study examined item-difficulty parameter estimate with 40 items across various samples sizes. In the study 3,000 examinees were simulated with ability level that was normally distributed. Itemdifficulty parameter that was estimated using the entire 3,000 examinees serves as the true item parameters estimates, while samples sizes of 100, 200, 300,... 1000 were randomly selected to make up 10 replications. The WINGEN statistical IRT software was used in estimating the item-difficulty parameter and he reported a sample size of 500 as the minimum requirement for stable parameter estimate.
Sahin, and Anıl (2017) conducted a study in which they considered various sample sizes (150,250,350,500,750 1000, 2000, 3000 and 5000) against different test lengths (10, 20 and 30) they concluded that both sample size and test length are important factors to consider in IRT item-parameter estimation and that sample sizes of 250, 350, 500 and 750 examinees can be used but it depends on test length, they presented a trade-off between sample size and test length, they concluded that a sample size of 150 is just okay for the estimation of the difficulty-parameter in the 1PLM irrespective of test lengths (10,20 or 30).
In a study conducted by He and Wheadon (2017) in which they investigated the effect of sample size on parameter estimates using the partial credit model revealed a trade -off between sample size and the accuracy of parameter estimation and they equally came to the conclusion that accuracy of item parameter estimation is a function of sample size.

Research Question:
What should be the acceptable test length and sample size for the estimation of Item-Difficulty Parameter in the IRT 1PLM?

Methodology:
The study adopted the Survey Research Design, the population comprised of the twenty three thousand two hundred and fifteen (23,215) examinees who sat for the National Examinations Council (NECO) Senior School Certificate Examination (SSCE) in Economics objective test paper III that was conducted in June/July 2015 in Edo State, Nigeria. The statistical sample was five thousand one hundred and fifty eight (5,158), the sample are the examinees that resounded to "Type C" option among the four available types A, B,C, and D.
Responses by examinees within the sample were randomly assigned with replacement to groups of 200, 500, 1000, 2000 and 5000 respectively; hence there were five sample sizes. The instrument contained sixty (60) items, for the purpose of the investigation the items were randomly selected with placements into groups of 10, 20, 30, and 50. Each of the sample size is paired with each test length; this amounted to twenty (20) statistical conditions, five sample sizes and four item lengths (5×4),as shown in table-1below. Estimates from the statistical sample and test length were treated/ assumed to be the true parameter values. In the report of Sawminathan, Hambleton, Sireci, Xing, & Rizavi, (2003) estimating the item-parameters using the entire population of examinees will produced the true item parameters .These values were compared with what was obtained in other combinations and analyses, all parameter estimations were done using the eirt -Item Response Theory Assistant for Excel statistical software by Germain, Valois, & Abdous, (2007).
The differences between the true parameter values and values obtained from other sub-samples were seen as the effect of sample sizes and test lengths. The Root Mean Square Errors (RMSEs) statistics was adopted in making this comparison. The computational formula for RMSE is presented;

RMSE ng mp mp
Where: δi is the estimated item parameter while Ti represents "true" item parameter, and k is the test length. The smaller the RMSE the closer the estimated parameter values are to the true parameter values and better estimates. In order to determine the feasible sample size and test length, Rudner, (1998) opined that RMSE ≤ 0.33, which corresponds to the classical reliability value of 0.90, is taken as the criteria for minimum feasible sample size for that particular test length and IRT model. Though, Han, Kolen, & Pohlmann (1997) asserted that Journal of Education and Practice www.iiste.org ISSN 2222-1735 (Paper) ISSN 2222-288X (Online) Vol.10, No.30, 2019 a RMSE less than 0.6 were considered small and equally considered as okay. The RMSE ≤ 0.33 was adopted in evaluating the results of this study, since it met the criteria for both, though a condition that appears to be more stringent.  Table 2 contained the analysed results obtained under the various statistical conditions of test lengths and sample sizes, from the results; a sample size of 200 did not yield an acceptable RMSE when combined with test lengths of 10, 20, 30 and 50 (RMSEs > 0.33). In the same vain a sample size of 500 did not yield an acceptable RMSE when combined with test lengths of 10, 20, 30 and 50 (RMSEs > 0.33). However a sample size of 1000, yielded RMSEs less than 0.  Figure 1 shows the graphical representation of the result contained in Table 1, it can be observed that the lines representing the test lengths all went above the 0.33 RMSE in the y-axis under sample sizes 200 and 500 but less than 0.33 under 1000, 2000 and 5000 sample sizes.

Discussion of Findings
From the results presented above, it showed that for an acceptable, stable and accurate estimation of the itemdifficulty parameter under the 1PLM a sample size of 1000 and test length as low as 10 can suffice. The finding contradicted the findings of Sahin, and Anıl (2017) who concluded that a sample size of 150 is good enough in combination with test of 10 and above. However there appears to be a point of agreement with respect to test length and again the trade-off between test length and sample size.
The findings from the study agree with the finding of stone (2003) who reported a convergence or stable parameter estimate only after the sample size attained a high of 500. Also the findings corroborate that of Custer (2015) who came to the conclusion that a sample size of 500 in combination with a test length 40 is needed for accurate item-difficulty parameter estimation. Though Custer attained a sample of 500 before arriving at this conclusion, the conclusion of 500 samples being good enough is not far from the fact that there exists a trade-off between test length and sample size, when a lesser test length is needed the sample size will definitely increase in order to obtained an accurate estimate.
The implication of the findings therefore is that for the IRT measurement framework to be used in test development there should be up to 1000 examinees and up to 10 items the possibility of 1000 examinees is a far cry from many operational situation, where the examinees may not be as many as 1000 therefore the IRT procedure is still limited to a large extent in its scope of applicability in test development.

Conclusion
From evidences gathered in the study the researchers therefore concluded that, when the focus in estimation is on the item-difficulty parameter alone as in the 1PLM, and a high level accuracy is desired, the sample size should be at least 1000 and test length at least 10. However a sample size of 500 with test length 10 can still yield an Journal of Education and Practice www.iiste.org ISSN 2222-1735 (Paper) ISSN 2222-288X (Online) Vol.10, No.30, 2019 acceptable item-difficulty parameter estimate, since the RMSE at these points still fall less than 0.6, a criterion provided by Han, Kolen, & Pohlmann (1997).

Recommendations
Arising from the findings of the study, the researchers recommended that when the IPLM is the model of choice employing the IRT framework a test length of 10 and sample size 1000 should be used in order to have high accuracy of the estimated difficulty parameter.