P-, D- and Rit values, a new start

When you have built and implemented a test and you think you are done…then think again!

You are not done yet.

Computer based tests that consist of items that can be auto-marked generate heaps of management information.

The testing tool records the input of all candidates and this data can be turned into valuable information. P-, D- and Rit-values are examples of useful psychometric data. Through analysing the statistics and making appropriate changes the test can be improved.

Manual Task

Despite the fact that the testing system provides all of the statistical data, improving test items is a manual task.

It is important that you understand the data, so it can be interpreted correctly. The process of reviewing the performance of items is what we call psychometric analysis.

So, let’s imagine you are running a testing program and you have extracted the P-, D- and Rit-values.

Then what?

Let’s have a look at what these values represent.

The P-value.

This value represents the share of candidates that answered the question correctly. In practical terms: If an item has a high P-value then chances are that the question was fairly easy; most candidates who were presented with this item provided the correct response. Items that are either too easy or too difficult do not add great value to a test. In fact, easy questions are likely to distract good test takers because the candidate assumes that the question cannot be that simple to answer. The optimal P-value for an item is in between the 0.3 and 0.8 mark.

The D-value.

This value represents the distribution of the candidates’ choices for the available distractors. The question that we want to see answered is: Were all of the answer options (A, B, C etc.) used by the candidates? The purpose of distractors is creating plausible answers for test takers who do not master the subject sufficiently. Performing a distractor analysis will demonstrate that multiple choice questions really only need three answer alternatives. Using more than two distractors (note: two distractors plus the correct answer equals three answer alternatives) you will find that some distractors will not be selected at all. You may then decide to remove these distractors altogether.

The Rit-value.

This value reflects the performance of the item versus the test as a whole. It tells us to what extent an item contributes to isolating the good candidates from the entire pool of test takers. In short: The rit-value demonstrates the discriminating properties of  item.

Combining P-value an Rit-value

When combining the P-value and the Rit-value in a scatter chart we create a good view of the items that need to be evaluated.p-value Rit-value

The items in the orange boxes need to be reviewed.

It is striking to see that this test contains a high number of easy items (high P-values). The accepted norm for Rit-values is debatable. In general we will find scientific research papers make the following recommendation:

rit-values

The devil is in the detail.

And clearly there is much to be gained from a thorough analysis of the performance of items.

As such, the delivery of your test is not the final step in the process.

It is merely the start of improving the validity of your testing program.

 

So, is a randomised test fair?

It depends.

A randomised test can be totally fair but it can also be biased.

A test is biased when the results have consequences that unfairly advantage or disadvantage test takers.

Is it possible to determine whether a test is fair? Whether it is equally difficult for all candidates?

Yes it is. But only in hindsight.

An analysis of the average p-value of the test is of great help in establishing the fairness of the test. When the average p-values ​​are spread across a broad range then it is highly likely that several tests had varying levels of difficulty.

p-value

But…in hindsight is too late!

You want certainty about the fairness of your test before it is delivered to your candidates.

It will prevent you from having to deal with some considerable headaches afterwards :) .

So, how is this done?

How does one create a test that is fair to each and every candidate?

The key condition is that you have a clear understanding of the level of knowledge and/or skills of your candidates.

Your candidates really do not have to know the answers to all of the questions. In fact, some things are learnt by doing and through gaining experience.

So what does this have to do with randomised tests?

Everything!

Because: When designing a randomised test you want to ensure that candidates who come well prepared are presented with questions that they can answer.

You want to be able to distinguish the competent candidates from those that require some further education. Evidently, you work your way back starting from the norm.

So what is the goal? What do you want to measure?

You want to be able to assess a candidate’s knowledge and insight at a predetermined level of the subject matter. You want to set a standard. And for this to work correctly you need to be a subject matter expert. You apply your knowledge of the subject matter during an item review and qualify all of the items into buckets of easy, moderate and hard questions. This method of standard setting is the foundation for a good randomised test. This approach is known as the Method of Angoff.

Depending on the testing solution that you use you ensure that all levels of difficulty are reflected in your test specifications matrix – or blueprint. QuestionMarkPerception has this to say on the subject.

Seeding items

Andriessen’s Sisto offers the possibility of seeding pretest items. These items do not count towards a candidate’s final score but you can use them for determining their p- and rit-values. Is it a hard, moderate or easy question.

When you have collected sufficient information about the pretest item you then decide whether it can be included in the test. This allows you to remove an item that is not performing well or you amend it and include it as a new pretest item in the next release of your test.

Taking this approach to the design and further development of your test allows you to improve its quality.

Your item bank will increasingly reflect items of a similar difficulty level.

Should you wish to use items with significantly different levels of difficulty then you will want to label your items or use a test matrix that is designed to fairly distribute these items.

Randomised tests: The advantages:

  1. They decrease the value of exam or item theft. Every test is different!
  2. It is easy to swap pretest items in and out.
  3. It allows you to gradually grow your item bank increasing the randomisation of items.
  4. Every candidate is presented with a test that is unique.

So, is a randomised test fair?

Yes.

But it requires work and maintenance. Particularly in the area of item difficulty.

Consider this: video clips in tests

One of the advantages of computer based testing is the ability to use multimedia files.

From a technical perspective the inclusion of video clips is not a problem. Almost all vendors of computer based testing solutions will offer this in some shape or form. There is a variety of low-cost tools that will allow you to place your videos online and embed them your test.

video in test

YouTube offers a simple, reliable and cost-efficient way of embedding video in your test.

By default YouTube videos are public, which means that anyone can watch them.

Keep this in mind when you include video in a test.

The use of video complements the candidate experience and it can add flavour to formative tests. In regards to summative tests you want to be more careful. It is important to establish that candidates who have already seen the video before the test have no advantage over those who haven’t.

YouTube

YouTube offers the option of making a video unlisted.

This means that only those people that have the link to the video can view it. Unlisted videos don’t show up in YouTube’s search results unless someone adds your unlisted video to a public playlist.

A good alternative to YouTube is Vimeo. Vimeo Plus or Pro subscriptions are very affordable (approximately $60 or $200 per year respectively) and offer features such as video password protection, domain-level privacy and advanced views statistics. Furthermore you can add your own logo to the video player – a nice touch!

Consider this when using (online) video:

  1. What is the impact if a candidate has already watched the video before the start of the test?
  2. Do you have the rights to use the video in your test?
  3. Is the bandwidth sufficient for all candidates to view the video simultaneously?
  4. Can YouTube, Vimeo or another video player be accessed from the test station?

Sources: https://support.google.com/youtube/answer/157177?hl=en and https://vimeo.com/upgrade

Example of embedded video in the English Example.

The quest for distractors: Getting the wrong answers right

I need distractors. And not because I am bored. No, I am looking for information on distractors in multiple choice questions.

The word distractors is used for the alternatives to the correct answer. So we are talking the baddies, the wrong ones. But how do I get those right?

Literature is certain about one thing in relation to distractors. And anyone who has ever developed a test with multiple choice questions will agree:

It is difficult to create good distractors.

Naturally, writing a good question (or item) is an art. But we get them right most of the time. However, creating good distractors – nice alternatives – is a real challenge.

Here’s some help.

An important condition for the answer alternatives is: All distractors must be likely. That is, all distractors should be a potential answer to the question.

But when you are not an expert in the subject matter then you will most likely see all of the response options as equally logical or probable.

The purpose of creating good distractors is distinguishing the good from the bad test takers. No doubt you want a good candidate to answer all of the questions correctly. And the candidate who has not studied hard enough will be thrown off balance by the answer alternatives. He will start guessing what the correct answer is and, eventually, he will fail the test (hopefully). Guessing is done when a candidate is unsure. And what this means is that the candidate will try and argue which answer fits the question.

Recommendations for creating great distractors:

  1. all distractors must be equally likely and grammatically and factually correct.
  2. all distractors must be of similar length and try to keep them brief. Provide relevant information in the question instead of in the answer alternatives.
  3. make use of counterexamples for creating distractors, do not use (double) negatives.
  4. all distractors must be written in the same style. If possible, avoid lingo and watch out for vague descriptions.
  5. use a limited number of distractors. Three answer alternatives are as good as four alternatives. In practice one out of four answer alternatives is rarely selected.

And on that last point: It is a best practice to analyse the performance of your items. Verify whether all of your answer alternatives were used by candidates. Have a look at this pie chart.

afleider

You can see that the alternative D is never selected. Alternative C is selected by two of the 201 candidates.

Conclusion: there is work to be done.

In any case, alternative D can be deleted.

Last but not least: Four-eyes principle

Remember always that the best way to make good questions and answers is the four-eye principle.

When you create a question with possible answers, you have to check the question by a colleague. Two see more than one.

Good luck with making good distractors.

Want to read more?

Writing good multiple choice test questions

Multiple choice is king

Computer based testing tools all offer the well-known and highly maligned multiple choice question, also known as the MCQ item type.

But did you know that testing (or examination) tools offer many other different item types? And that most of these are based on closed questions?

Candidates’ responses to closed questions can be automatically marked.

In my view this is a great example of the benefits that testing software offers versus the classic paper and pencil test.

Providing the candidate with the result of his or her test does not require manual intervention. The result can be automatically sent to the candidate or the institution that sponsors the test.

Types of closed questions

Examples of closed questions include:

  • Multiple Choice Single Response: One of the answers is correct
  • Multiple Choice Multiple Response: More than one answer is correct
  • Drag and drop (matching): Drag and drop an object in an image or piece of text
  • Ranking: Put lines of text or images in the correct order
  • Fill in the blank: Enter the correct word or combination of words into a text box
  • Hotspot: Place a marker on the correct spot in a picture, video or image
  • Numerical: Enter the answer to a numerical or mathematical question

All of these questions can be marked automatically.

But are all these item types used?

Well no, not really.

Have a look at this data that we pulled from Sisto:

 

Itemtype Volume  Share
Hotspot (single marker)

3,988

0,0%

Hotspot  (multiple markers)

4,248

0,0%

Fill in the blank (single)

37,043

0,1%

Fill in the blanks (multiple)

56,546

0,1%

Multiple Choice Single Response

38,692,327

90,2%

Multiple Choice Multiple Response

1,273,913

3,0%

Numerical

898,283

2,1%

Essay (computer based)

1,428,539

3,3%

Ranking

81,307

0,2%

Essay (on paper)

185,965

0,4%

Matching

10,930

0,0%

Speaking

231,100

0,5%

Upload

166

0,0%

Total

42,904,355

100,0%

Clearly the multiple choice item type is the most popular. By a long way.

So why do the technical specifications of tenders or requests for proposal often put such a strong focus on the types of items that a vendor is able to support?

It really is not as important as we are led to believe.

Take it from me: The multiple choice question is king!