Most schools in Texas are already conducting assessments and using data in various ways. The goal of this module is to help your team ensure you get the most out of these data to make important decisions about literacy instruction at your school.
In this lesson and throughout the Assessment module, you will be asked to reflect on and look critically at your assessment practices and to extend that discussion among your staff. This section outlines key elements of assessment to help you and your staff use the same language and vocabulary as you engage in this discussion.
Reliability
In the realm of assessment, reliability is about consistency: Will the measure consistently produce similar results under similar circumstances? Those similar circumstances might be taking the same test again, being assessed by another examiner, being assessed with another method, or even answering similar questions on the same test. The idea is that in all cases, if the results are consistent, we consider them to be reliable.
A simple example of an assessment measure that could give reliable results is a scale. If you weigh yourself each morning after your shower, you may get a reliable (consistent) result, staying more or less the same or showing small, expected variations based on your recent activities.
Reliability can be calculated, and you will find reliability expressed as a decimal. The closer the decimal is to 1.0, the more reliable the assessment is for its prescribed purpose. Most experts believe that .80 or higher indicates strong reliability. Note that most schools and districts do not have the resources to test locally developed assessments for reliability. If reliability has not been evaluated, you cannot assume that the results from such assessments alone are a reliable source of information about students' skills and progress.
Keep in mind, though, that reliability alone does not guarantee a true measure. Results can be consistently wrong, as well as consistently right. That bathroom scale just might be off a few pounds. For this reason, scales used in high-stakes endeavors, such as the sale of precious metals or weight-governed sports like wrestling, are carefully calibrated. Literacy assessment data also need to be more than just reliable; they also need to be valid.
Validity
Validity is the degree to which the results of an assessment actually reflect what the test and the interpreters of the test intend to measure. Looking at validity means ruling out any variables that are not related to what you are trying to measure.
Reliability is a component of validity. If your bathroom scale showed great differences in your weight from day to day, this inconsistency would indicate that there is something distorting the real results, some outside variable (other than your weight) that is influencing the number on the scale. Therefore, those results would not be valid.
Reliability alone is not enough to ensure validity, however. There may be variables that consistently distort the results of the assessment to the same degree or in the same manner. Imagine you are asked to re-take the professional portion of your certification exam, but this time in Russian or some other language you do not know. Even if you take the exam multiple times, you are likely to have low results. Not understanding the test itself presents a significant (and consistent!) variable unrelated to what the test says it is measuring. That consistency indicates high reliability, but this is not a valid measure of your knowledge of the field of education.
An important point to keep in mind is that assessment data may be valid for one purpose but not for another. As stated above, validity is in relation to what the assessment is intended to measure. Throughout the Assessment component, the TSLP calls upon you to ask yourselves, "Are these data valid for the instructional decision we are considering?" The information in each lesson will help you answer that question for each type of assessment.
Formal and informal assessments
Formal assessments are those that are administered and scored using prescribed procedures and that have been examined for reliability and validity. Examples include state assessments such as the State of Texas Assessments of Academic Readiness (STAAR), end-of-course exams, and the Texas English Language Proficiency System (TELPAS) reading test. Commercially produced assessments that have been field tested for reliability and validity, and that are administrated and interpreted in a standard way, would also be considered formal assessments.
Informal assessments are those that are not administered with standard procedures or that have not been examined for reliability and validity. These include teacher-created tests, including essay tests; informal reading inventories; observations of students at work; student presentations; and department and district tests that have not been rigorously field tested for reliability and validity.
Informal assessments are useful in many ways. The teacher discretion involved in informal assessments may allow for the assessment to be adapted or deepened in response to student performance. Whether in one-to-one conferences, observations of groups, or "ticket out" questions at the end of class, teachers frequently gather data informally to help them gauge student understanding and determine next steps in their classes. Informal assessments alone are not, however, appropriate as the main source of data to make instructional decisions called for in the response to intervention (RTI) model.
Norm-referenced versus criterion-referenced
These terms refer to the way that assessments are designed to be interpreted. One of these ways is to compare a student's performance with what is expected at the student's age and grade level. That expectation is the "norm" in "norm-referenced." The SAT, ACT, and Graduate Record Examination (GRE) are examples of norm-referenced assessments. This type of test is best at answering the question "How well is this student developing this knowledge or skill compared to his or her peers?"
Criterion-referenced tests compare a student's performance not to that of other students in a cohort or group, but to an expected level of performance that is established ahead of time. A test with a predetermined passing standard, like a teacher certification exam, is criterion-referenced. This type of test attempts to answer the question "How well did this student master this content or these skills?"
Formative versus summative
Like other terms here, the terms "formative" and "summative" are linked to the purpose of the assessment and how the results are used. When teachers gather information during a lesson or a course in order to measure interim learning and to make instructional adjustments accordingly, this is formative assessment.
Summative assessments are given at specific times to measure the achievement of students at that time. The information is used to report individual student performance and can also be used to evaluate how groups of students performed in a given course. Low performance in particular areas of a summative assessment (or overall) may spur educators to look into ways to adjust future instruction or curricula for a course, but this would not typically impact the students who were assessed. With formative assessment, teachers are in the process of teaching the content that was assessed, and they are just in time to adjust and address areas of need indicated by the data.
Formative assessments are often informal (without strict administration and interpretation procedures). Not all summative assessments are formal, however. Unit tests, midterms, and final exams that are typically developed at the classroom, grade, school, or district level may vary in the strictness of their administration and interpretation. What is more, these tests are rarely field tested to determine their reliability and validity and, therefore, should be understood as informal.