Reliability

Geometric image of pencils

This post provides a deeper look into the ideas of Equity of Opportunity, one of JumpRope’s Core Values.

Most of the time when we hear the term “validity” with respect to assessment, it’s coupled with the term “reliability”. If you’ve kept up with our current blog series on using rubrics for assessment, you know that the second post in the series discusses validity. I like to decouple these two terms to help people understand the difference between them. While validity is largely about the efficacy of an assessment, reliability is about replicability of results from one administration to the next. This said, an assessment cannot be deemed valid unless it is deemed reliable. And assessments have the best chance of being deemed reliable when we control, as best we can, for certain variables. Those include the suitability of the task for the particular students engaging with it, student readiness for assessment, the language used throughout the assessment, the consistency in assessment administration, and the time elapsed between administration and evaluation. As is the case with validity, there are a few different ways we can consider reliability.

While validity is largely about the efficacy of an assessment, reliability is about replicability of results from one administration to the next.

Adjusting the variables for an assessment situation can help us explore that assessment’s reliability. In the case of test-retest reliability, if a teacher administers an assessment for one group of students and then for a second group of students, the results should be stable between groups. This definition of reliability pertains to an assessment that uses a rubric just as much as any other assessment. The results of the application of the rubric should be consistent between different sets of students from period 1 to period 2 or from year 1 to year 2. Test-retest reliability is about varying the groups of students who engage with the assessment. Varying the assessment tool or the probes within the tool is referred to as parallel forms of reliability. We should check for parallel forms of reliability when we differentiate our assessment tools. Think about giving students a choice of creating an advertisement, writing an editorial, or participating in a debate, each as a demonstration of skills of persuasion. In reviewing the rubrics if there is a clear correlation between the results of the three assessment methods, the assessment is considered reliable. Even better is the application of a single rubric to assess each of the three different products.

Geometric image of notebooks and pencils

The final type of reliability I’ll discuss helps us maintain consistency from one evaluator to the next: Inter-rater reliability. In using inter-rater reliability we work with our colleagues, using the same assessment tools, to evaluate a collection of student work. When our evaluations of the same piece of work align, we know we have attained inter-rater reliability. We sometimes see the use of common assessments as the pinnacle of strong assessment practice. I would argue though, that collaboration with colleagues in an effort to move toward a common understanding of what it means to meet, partially meet, or exceed expectations is part of the gold standard for ensuring assessment reliability. Using common assessments is one good step toward providing equity across classrooms, but it’s when our interpretation of those assessments aligns with our colleagues, that we provide the greatest equity across all our students.

As we design and refine our rubrics and all the other materials we use to assess students’ assignments and support them through the assessment process, we need to be mindful of controlling the variables we actually can control.

Working toward reliability within our assessments is one more way to step toward equitable practice. As we design and refine our rubrics and all the other materials we use to assess students’ assignments and support them through the assessment process, we need to be mindful of controlling the variables we actually can control. We should aim to design rubrics and other assessment tools that produce stable results from one group of students to the next, and that also yield stable results for assessed criteria even when the products created by students vary. Finally, as we evolve toward being better assessors, in establishing practices and setting aside time to move toward inter-rater reliability, we can rest assured that meeting expectations is not a question of who is assessing, but rather an agreed-upon common understanding of the learning goals.