With the proliferation of computer and Internet technologies, Computer Aided Assessment (CAA) tools have become a major trend in academic institutions worldwide. Through these systems, tests composed of various question types can be presented to students in order to assess their knowledge. Yet, there has been considerable criticism of the test quality, with both research and experience showing that many test items (questions) are flawed in some way at the initial stage of their development. Test developers can expect about 50% of their items will fail to perform as intended which may eventually lead to unreliable results of examinee performance . It is therefore imperative to assure that the individual test items are of the highest quality possible since a poor one could have an inordinately large effect on some scores.
There are two major approaches to item evaluation using item response data, and both can be used, sample size permitting. The classical approach focuses on traditional item indices borrowed from Classical Test Theory (CTT) such as item difficulty, item discrimination, and the distribution of examinee responses across the alternative responses. The second approach uses Item Response Theory (IRT) to estimate the parameters of an item characteristic curve which provides the probability that an item will be answered correctly based on the examinee's ability level as measured by the test.
The natural scale for item difficulty in CTT is the percentage of examinees correctly answering the item. One term of item difficulty is p-value, which stands for the proportion of percentage of examinees correctly answering the item. Every item has a natural difficulty based on the performance of all persons undertaking the test; however, this p-value is quite difficult to estimate accurately unless a very representative group of test-takers is being tested. If for example the sample contains well instructed, highly able or highly trained people, then the test and its items will appear very easy. On the other hand, if the sample contains uninstructed, low-ability or untrained people, then the same test will appear very hard. This is one of the main reasons that CTT is often criticized for , , because the estimation of the p-value is potentially biased by the sample on which the estimate of item difficulty is based.
With IRT the composition of the sample is generally immaterial, and item difficulty can be estimated without bias. The one-, two-, and three-parameter binary-scoring (dichotomous) IRT models typically lead to similar estimates of difficulty, and these estimates are highly correlated to classical estimates of difficulty. Additionally, while classical statistics are relatively simple to compute and understand and do not require sample sizes as large as those required by IRT statistics, they a) are not as likely to be as sensitive to items that discriminate differentially across different levels of ability (or achievement), b) do not work as well when different examinees take different sets of items, and c) are not as effective in identifying items that are statistically biased . As a result, the use of IRT models spread rapidly during the last 20 years and they are now used in the majority of large-scale educational testing programs involving 500 or most test-takers.
IRT analysis yields three estimated parameters for each item, a, b and c respectively. The a parameter is a measure of the discriminating power of the item, the b parameter is an index of item difficulty, and the c is the "guessing" parameter, defined as the probability of a very low-ability test taker getting the item correct. A satisfactory pool of items for testing is one characterized by items with high discrimination (a > 1), a rectangular distribution of difficulty (b), and low guessing (c < 0.2) parameters , . The information provided by the item analysis assists not only in evaluating performance but in improving item quality as well. Test developers can use these results to discriminate whether an item can be reused as is, should be revised before reuse or should be taken out of the active item pool. What makes an item's performance acceptable should be defined in the test specifications within the context of the test purpose and use.
Unfortunately only a few test developers have the statistical background needed to fully understand and utilize the IRT analysis results. Although it is almost impossible to compel them to further their studies, it is possible to provide them with some feedback regarding the quality of the test items. This feedback can then act as a guide to discard defective items or to modify them in order to improve their quality for future use. Based on that notion, the present paper introduces a comprehensible way to present IRT analysis results to test developers without delving into unnecessary details. Instead of memorizing numerous commands and scenarios from technical manuals, test developers can easily detect problematic questions from the familiar user interface of a Learning Management System (LMS). The latter can automatically calculate the limits and rules for the a, b, and c parameters based on the percentage of questions wanted for revision. The examinee's proficiency (θ) is represented on the usual scale (or metric) with values ranging roughly between -3 and 3, but since these scores include negative ability estimates which would undoubtedly confuse many users, they can optionally be normalized to a 0..100 range scale score.
The use of Learning Management Systems (LMSs) and CAA tools has increased greatly due to the students' demand for more flexible learning options. However, only a small fraction of these systems supports an assessment quality control process based on the interpretation of item statistic parameters. Popular e-learning platforms such as Blackboard , Moodle  and Questionmark  have plug-ins or separated modules that provide statistics for test items, but apart from that they offer no suggestions to test developers on how to improve the problematic items. Therefore, many researchers have recently endeavored to provide mechanisms for test calibration.
Hsieh et al. introduced a model that presents test statistics and collects students' learning behaviors for generating analysis result and feedback to tutors . Hung et al. proposed an analysis model based on CTT that collects information such as item difficulty and discrimination indices, questionnaire and question style etc. These data are combined with a set of rules in order to detect defective items, which are signaled using traffic lights . Costagliola et al.'s eWorkbook system improved that idea by using fuzzy rules to measure item quality, detect anomalies on the items, and give advice for their improvement . Nevertheless, all of the aforementioned works preferred CTT to IRT for ease of use without taking into consideration its numerous deficiencies.
On the other hand, IRT has been mainly applied in the Computerized Adaptive Test (CAT) domain for personalized test construction based on individual ability , , , , . Despite its high degree of support among theoreticians and some practitioners, IRT's complexity and dependence on unidimensional test data and large samples often relegate its application only to experimental purposes. While a literature review can reveal many different IRT estimation algorithms, they all involve heavy mathematics and are unsuitable for implementation in a scripting language designed for web development (i.e. PHP). As a result, their integration in internet applications such as LMSs is very limited. A way to address this issue is to have a webpage call the open-source analysis tool ICL to carry out the estimation process and then import its results for display. The present paper showcases a framework that follows the aforementioned method in order to extend an LMS with IRT analysis services at no extra programming cost.
Several computer programs that provide estimates of IRT parameters are currently available for a variety of computer environments , . These include Rascal , Ascal , WINSTEPS , BILOG-MG , MULTILOG , PARSCALE , , RUMM  and WINMIRA  to name a few that are easily obtainable. Despite being the de facto standard for dichotomous IRT model estimation, BILOG is a commercial product and limited in other ways. Hanson provided an alternative stand-alone software for estimating the parameters of IRT models called IRT Command Language (ICL) . A recent comparison between BILOG-MG and ICL  showed that both programs are equally precise and reliable in their estimations. However, ICL is a free, open-source licensed in a way that allows it to be modified and extended. In fact, ICL is actually IRT estimation functions (ETIRM)  embedded into a fully-featured programming language called TCL ("tickle")  and thus allowing relatively complex operations. Additionally, ICL's command line nature enables it to run in the background and produce analysis results in the form of text files. Since the proposed framework uses only a three-parameter binary-scoring IRT model, ICL proves more than sufficient for our purpose and was therefore selected to complement the LMS for assessment calibration.
Dokeos is an open-source LMS accompanied by Free Software Foundation's   General Public License . It is implemented in PHP and requires Apache acting as a web server and mySQL as a Database Management System. Dokeos has been serving the needs of two academic courses at the University of Macedonia for over four years, receiving satisfactory feedback from both instructors and students. In order to extend its functionality with IRT analysis and assessment calibration functions, we had to modify the source code so as to support the following features:
- After completing a test session, the LMS stores in its database the examinee's response to each test item instead of keeping only a final score by default.
- Test developers define the acceptable limits for the following IRT analysis parameters: a) item discrimination, b) item difficulty, and c) guessing. The LMS stores these values as validity rules for each assessment. There is an additional choice of having these limits set automatically by the system in order to rule out a specific percentage of questions (Fig. 1.1).
- Every time the LMS is asked to perform an IRT analysis, it displays a page with the estimated difficulty, discrimination and guessing parameters for each assessment item. If the latter violates any of the validity rules already defined in the assessment profile, it is flagged for review of its content (Fig. 1.2). Once item responses are evaluated, test developers can discard, revise or retain items for future use.
- In addition to a total score, the assessment report screen displays the proficiency ? per examinee as derived from the IRT analysis (Fig. 1.3).
Once an update of the IRT results is called for, the LMS performs a number of calls to the ICL using PHP. This procedure comprises four stages:
- Dokeos exports the assessment results to a data file and generates a TCL script to process them (parameter estimation script).
- It then calls up ICL with the parameter estimation script passed as a parameter in order to create a data file containing the a, b, and c values for each test item. At the same time it prepares a second TCL script to process these IRT parameters (θ estimation script).
- Dokeos calls up ICL with the θ estimation script passed as a parameter so as to make a data file with the examinees' θ values.
- Finally, Dokeos imports the two ICL-produced data files (*.par and *.theta) to its database, thus extending its functionality with assessment calibration.
The proposed system has been carried out by adding the formerly described features to an existing version of Dokeos at the Dept. of Applied Informatics, University of Macedonia. A pilot assessment test containing a set of 40 questions on "Fundamentals of Information Systems" was arranged for the experiment. Since it was not connected to an actual university course and contained questions of general nature, it managed to attract the attention of 113 students who voluntarily participated to the experiment. Before administering the test, the acceptable limits for the IRT parameters were set to a = 0.5, -1.7 = b = 1.7, and c = 2.5 respectively. The IRT analysis following the completion of the assessment test revealed 9 test items that needed reviewing. In particular, items 6, 10, 12 and 33 showed a low degree of discrimination (Fig. 3), items 21 and 27 appeared too difficult and item 38 deemed too easy (Fig. 4). An extra couple of items (24, 37) were flagged for revision due to their high guessing value (Fig. 5). Once an initial item pool has been calibrated, examinees can be tested routinely. As time goes on, it would almost surely become desirable to retire items that are flawed, have become obsolete, or have been used many times, and to replace them with new items. With these problematic items already detected by the LMS, test developers can take any necessary course of action to improve their quality. Additionally, since the limits for the IRT analysis parameters are not hard-coded, test developers can modify them at will in order to tune the sensitivity of the system.
This paper introduced an architectural framework for extending an LMS with IRT-based assessment calibration. A case study was conducted at the University of Macedonia in order to examine the functionality and efficiency of the proposed system. The enhanced LMS proved to be a valuable tool in assisting test developers to refine flawed test items. Moreover, the user-friendly interface allowed users with no previous expertise in statistics to comprehend and utilize the IRT analysis results as provided by ICL.
Nevertheless, more accurate experiments involving a larger set of items and examinees are necessary to better measure the system's capabilities. To address this issue we intend to adopt the proposed system into an actual academic course. Our main focus is to investigate whether it can improve the assessment quality when the examinee samples are relatively small as is usually the case. Future plans also include a further extension of Dokeos by integrating more of ICL's functions, such as multiple group estimations or estimations for polytomous items. Our ultimate goal is to turn the enhanced LMS into a web interface for ICL capable of performing IRT analysis on data imported from a web client.