The Challenge of Designing a Valid Test

Validity, to give a short and non-technical definition, means “the extent to which a test measures exactly what the test maker intends it to measure, nothing more, nothing less.” (definition courtesy of Donald E. Powers, ETS).

Why is understanding a test's validity important? It's simple--not knowing how valid a test is means you don't know to what extent results are meaningful and actually provide the information you need.

How can you determine how valid a test is? As a test user, you can simply ask the testing company for test development background and validity statistics. For the test's creator, however, ensuring validity is a huge challenge.

Proven superior validity levels are the primary reason to consider TOEIC tests, more important even than the tests’ worldwide recognition, the high level of support provided for administration, and the ease of score interpretation for varied work levels and environments. ETS therefore makes validity its priority.

Step one in building a valid test is defining what exactly the test should measure. TOEIC’s goal is to measure a non-native English speaker’s ability to communicate in English in the context of the global workplace environment using key expressions and common, everyday vocabulary.

Test developers and language/assessment experts break this goal down as precisely as possible. What skills are most important for communicating effectively in this context and how important is each? 

TOEIC is researched and designed so that the most important components in communicating in English, such as listening for purpose, use of vocabulary, task completion, etc. are being measured and are included in the test in proportion to their real life importance for effective global English workplace communication.

A second important part of achieving test validity during test design involves eliminating to the greatest extent possible “construct-irrelevant components,” elements outside what the test was designed to measure that can impact test taker results. For example, overly complex test questions may make test takers anxious and unable to show what they can normally do. A test may take too long and measure test taker ability to answer questions quickly as much as their English proficiency. Questions may yield significantly different results when used with similarly capable individuals of different genders or from different cultures.

A few of the methods used to achieve validity during TOEIC test design:

  • Detailed test “blueprints” created by world-class test developers are used in designing tests, so that each form is highly similar to every other one and so that the right content is covered in the same proportion in each form.
  • World-class psychometricians design TOEIC “test items” (questions and answer choices).
  • Each individual test item goes through numerous reviews, being inspected by some 20 test reviewers before pilot testing. Some items never make it through this process, while others must undergo extensive revisions.
  • Pilot test results undergo statistical analysis at the item level to examine such elements as test question difficulty, how well the item distinguishes high from low ability, how incorrect options behave, and whether the test behaves consistently with test takers at the same ability level but from different subgroups.

Thise brief discussion gives only a taste of what’s involved in creating a quality test, and the challenge for TOEIC extends to every step of the testing process and across each test form’s lifecycle, from test administration and scoring to security and limiting form exposure.

Results? Top of class performance in validity, internal test consistency, fairness, and cross-form reliability.

We’re proud to distribute TOEIC tests for ETS in the United States as well as in many countries in the Middle East and North Africa. Contact us to learn how you can offer TOEIC at your site.

 

Lia Nigro, TOEIC USA Team

Comments

re: test validity/reliabilty and reality

I am continually stunned by language testing 'in the field'. That is, for all the careful definitions of validity and reliability in academia, what passes for 'testing' (that I've seen), truly causes me to shudder. Be it college EFL reading comprehension courses or junior and senior high school classroom exams, the chaos and lack of consistency, fairness, etc... and just plain validity just boggles the mind. One example to make this concrete. If I teach in a program where students are 'tested' on texts and required to read a text and answer a series of 10-12 questions - some multiple guess format, some long-answer, some true-false variety (and prove your position, by "copying from the text"), who's to say that these exercises have real validity? Who's to say that these 10 questions truly test reading comprehension? Who's to say that tricky true-false questions are a test of English comprehension as opposed to complex thinking? Do English programs 'test' their tests? Do we ever even administer version B to the same students and see if there's a close test-retest reliability?? Rarely. What allows classroom English teachers to say they have taught unit x and tested the students and they know/don't know the material? My recent experience tells me that most classroom teachers haven't got a clue! If a student demands to "move up" from level B to level A, let's say, in a hypothetical high school, and they have outstanding scores on their 'tests' in level B, by what logic do I say: "well, I'll tell you what, I'll just give you the reading comprehension and writing tasks of the level A group's last exam (not the grammar; not the vocabulary), and if you can score an 80 on just those sub-tests, we'll move you up." Since when is that a valid test? Have we stopped teaching teachers anything about test theory? Do teachers understand reliability and validity? Is an analysis of literary themes the same as competence in English or is it a measure of intelligence and thought processes independent of English competence? Thanks for letting me blow off steam... There is SUCH a mismatch between English language measurement theory (or test theory in general) and what passes for testing in the day to day in so many places... I am still reeling from what I see.

Post new comment

The content of this field is kept private and will not be shown publicly.
  • Allowed HTML tags: <iframe><embed><div><span><p style><br><br/><a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl><dt><dd><h1><h2><h3><p><table><tr><td><tbody><img><b><i><strike><div class="rteindent1"><div class="rteindent2"><div class="rteindent3"><div class="rteindent4"><p class="rteindent1"><p class="rteindent2"><p class="rteindent3"><p class="rteindent4"><h1 class="rteindent1"><h1 class="rteindent2"><h1 class="rteindent3"><h1 class="rteindent4"><h2 class="rteindent1"><h2 class="rteindent2"><h2 class="rteindent3"><h2 class="rteindent4"><h3 class="rteindent1"><h3 class="rteindent2"><h3 class="rteindent3"><h3 class="rteindent4"><h4><h5><sup>
  • Web page addresses and e-mail addresses turn into links automatically.

More information about formatting options

Image CAPTCHA
Enter the characters shown in the image.