Reliability Versus Validity; Statistical Concepts
Measurement and Assessment
In Psychometric testing, the concepts of Reliability and Validity are used to describe the types of raw data derived from tests and how this data relates to the real abilities of the people being tested. Psychologists, those in the, "helping professions," most concerned with measurement use these two criteria to support the veracity of their results when testing for such things as achievement, aptitude, intelligence, personality, or symptomology.
Reliability can be thought of as how consistently a test returns results that are the same (within margins of error) for one person tested repeatedly or how well a test replicates a spectrum of results in respect to a normed group. Validity, on the other hand, pertains to how well a test measures what it purports to measure.
For instance, a watch that is 12 hours off is reliable, assuming it runs normally and has merely been miss-set, but never valid. An unset clock flashing twelve is never reliable but is valid for two sixty second intervals a day. A clock that both runs normally and is set correctly is both reliable and valid. To extend this metaphor to another case consider a target. In this case the bull's-eye is the person's true ability. A test that produces results for the same person with scattered hits all over and around the target is neither reliable nor valid. A test that produces a tight grouping of hits away from the bull's-eye is reliable but not valid and a test that produces a tight grouping of hits right on the bull's-eye is both reliable and valid.
You may have noticed that reliability proceeds validity, In that a test may be consistently reliable without validity, but cannot be consistently valid without reliability. Any kind of meaningful Validity presupposes Reliability.
Reliability in Depth
Reliability is measured as a Reliability Coefficient. This is a statistical description of how two measurements relate to one another. A high Correlation coefficient indicates a high degree of reliability. These coefficient are measured between 0.00 indicating a complete lack of reliability and 1.00 indicating perfect reliability.
The following methods are used to procure measurements that should in theory have high reliability coefficients. First test-retest compares a single person's performance on a test at time one to their performance on the same test at time two. This is best done with alternative forms of a test (providing the alternative forms have already been evaluated for reliability between versions) Simply divide the lower raw score by the higher raw score to produce a test-retest reliability coefficient.
A few methods are used to establish reliability within a single test. The simplest and least used is called split-half reliability. In this method performance on, for instance, every odd question might be compared to performance on every even number question. If both halves of the test produce two consistent raw scores then the test can be said to be internally consistent. In addition to this, Kuder-Richardson Formulas 20 and 21, and Cronbach's Alpha are two more statistically complex means of computing internal consistency.
Validity in Depth
When describing Validity we must first differentiate between types of Validity. Content validity, Criterion-Related Validity, and Construct Validity have important distinctions.
Content Validity refers to how well the questions on a test refer to the domain being measured. For instance a 10th grade Biology Reagents exam should have questions that measure what is taught in a tenth grade Biology classroom. This type of validity is most important in achievement and aptitude tests.
Criterion Validity can be further broken down into two types of Validity; concurrent and predictive validity. For instance a high score on the Minnesota Multiphasic Personality Test (MMPT) aggression scale should have some correlation with reports of aggression from a child's teachers and parents. This demonstrates that a test is Concurrently Validating the outside criteria of aggression as subjectively demonstrated by actual behavior. A test with high Predictive Criterion Validity should render a score that can accurately portend future performance. For instance SAT scores have a high Predictive Criterion Validity when compared to future college GPA's. And as the test is supposed to measure readiness for post-secondary education this correlation establishes the test's validity.
A last type of Validity is Construct Validity. Constructs are theoretical and cannot be directly measured. A test that has Construct Validity should assess variables and behaviors that relate to a theoretical construct such as anxiety or intelligence. These constructs are typically derived from prior research, Psychological theory, or observation. Construct Validity can be established by correlating scores on a new test to previous tests measuring either the same construct or the antithetical construct. If a new test for depression correlates highly with the Beck Depression Inventory (BDI) this is called convergent Validity. If a subject measures highly on a test to rate abstract reasoning then you would expect them to score lower on a test for concrete reasoning. This is termed Discriminant Validity.