What is infit?

In this article, a more detailed explanation of infit measures is provided, both for judges and for candidates.

Let us firstly examine in detail how the comparative judgement process works.

In each judgement, two scripts are presented and one is chosen - that decision is recorded (1 for chosen, 0 for not chosen) as the data for that judgement along with which two candidates were judged. A judge carries out a series of these judgements and all the data is collected for that judge. Of course, there may be a number of judges so the total data for the task is the data from all the judges.

This total data is fed into a mathematical model to arrive at the true scores for each candidate - this takes into account not just how many times a candidate’s script was chosen but also who they were compared against each time (and how highly these eventually scored).

Based on these resulting true scores for candidates, we can then look back and calculate the probability of one script being chosen over another (by considering their true scores). We can then compare this probability with the actual decision made for these two candidates. This gives a ‘residual’ value for that decision:

residual for decision = actual decision (1 or 0) - probability for that decision (between 1 and 0)

Actually, to remove negative numbers we look at the residual squared. And this residual squared of the decision is telling us whether the actual decision agrees with what we would expect taking into account all the decisions. A low residual squared (closer to 0) indicates a decision in line with other decisions, a high residual squared (closer to 1) a decision that goes against the others.

Judge infit

So for each decision made, we can obtain a residual squared value. As a judge makes a series of judgements, we can effectively* average all their residual squared values to work out a total infit for that judge. The higher the infit the more out of step the judge is with the other judges and their decisions. With our national assessments, if a judge infit is more than 1.3 on the local task, we exclude that judge in the moderation task to make sure there is consistency in the moderation.

(*We say effectively because it is not quite a straight forward average. There is a weighting given to each residual square value which depends on how close in quality the compared scripts are according to the final true scores. This is why we can get values of over 1 for infit).

Candidate infit

Similarly, we can average over the decisions for a candidate. A candidate with high infit suggests a script where there has been general disagreement amongst the judges about that script. A rule of thumb here is to perhaps look at scripts with candidate infit over 2.0

What impacts on infit

So if a judge is disagreeing with the other judges generally that will likely lead to a high infit. This may be due to making the judgements too quickly (so worth looking at the median judging time for that judge) or simply having a very different perspective on what makes a good script. It doesn’t necessarily mean that the judge is ‘wrong’ though - it simply means they have a different view to the other judges. Also a few mistaken judgements may increase the infit, so it is always worth looking at the information on the judgements made by a judge - more information on how to do that is provided here. Related to this, if a judge only makes a few judgements, and one of them is less accurate, they may end up with an unusually high infit. So low numbers of judgements can have an impact too.

What if my task has a high infit judge?