Internal
and External Validity
Extraneous Variables and
Validity
Extraneous
variables are variables that are not a planned
part of research; e.g., research
on whether certain instructional methods are associated with higher achievement. Extraneous variables may “interact with” independent (input, intervention) variables (instructional
methods) to produce an effect; e.g., teacher warmth may interact with how the
teacher demonstrates a math routine, and yield greater attention and acquisition
of skill. Or extraneous variables may produce
an effect all by themselves; e.g.,
home instruction might increase math achievement. Therefore, change (or lack of change) in dependent (outcome) variables (e.g.,
math achievement) may be entirely or partly the result of extraneous variables,
such as maturation, or other things happening inside and outside of school
(e.g., siblings teach some students to read) or measurement error (students appear to
read better because observers at the outcome assessment failed to count many
errors) or bias in selection (e.g., if the experimental group has many bright
students and the control group doesn’t, that difference---and not instruction---may
account for differences in achievement).
Findings and conclusions are not valid,
credible (believable), dependable, or good evidence
for making decisions if researchers can’t
rule out the strong possibility that OTHER factors account for findings.
Let’s say you are using official statistics to see which kind of reading program produces
the highest achievement in a school district.
You use end-of-grade test scores to divide the district into (1) schools
in which over 80% of students pass the tests; and (2) schools in which less
than 80% of students pass the tests. Then
you contact the schools and interview administrators and teachers to find out how their schools teach reading. But what if, in general, schools with the
lowest reading achievement also have
administrators who are so out of touch with school instruction that they really
can’t tell you how they teach reading?
Your data on how their schools teach reading would not be accurate; your data would be invalid. In this case, the
inaccuracy of the “observer” (the administrator) is an extraneous variable that muddies the picture. It’s called “extraneous” because it’s not part of the process you are
investigating.
The Relationship Being Studied
Reading instruction à Measurement
(and therefore findings) on
Achievement
^ ^ ^ ^ ^
^ ^ ^
^ ^
| | | | |
| | |
| |
Maturation Measurement error
Home instruction Biased sample
Teacher communicates Experimental mortality (poor readers drop
out)
high
expectations
Possible Extraneous Variables
that Influence Findings on the Possible Relationship
Or, let’s say you are conducting an experiment to see if peer tutoring
will increase retention of math skills.
Some students participate in peer tutoring (for a month) and others do
not. You compare math skills before,
right at the end, and every week for five weeks after the month of peer tutoring.
Sure enough, students who participated in peer tutoring do maintain math skills more than
students who did not receive peer tutoring.
But what if more of the students who received peer tutoring also had siblings who worked with them on math at
home, and what if this IN PART affected their retention of math skills? Help at home would also be an extraneous variable that muddies the
picture, and makes your findings (“Peer tutoring seems to be effective.”)
invalid. In summary, many other things besides
the hypothesized independent variables can account for findings. An important part of research is trying to rule out the possibility that OTHER factors
(extraneous variables) account for findings.
Internal and External Validity
"Internal
validity" refers to how accurately the data and the conclusions drawn from the data (e.g.,
Change in X causes change in Y) represent what really happened. For example, looking at pre-test and post-test
scores, it may seem that a training program increased teachers' skills.
However, some of the difference between pre- and post-test scores may be the
result of measurement error; during
the post-test, observers wrongly scored some sloppy teaching as “proficient.”
"External validity" refers to how
accurately the data and your conclusions drawn from the data (e.g., Change in X
causes change in Y) represent what goes on in the larger population. For instance, if a sample of teacher-trainees
is biased in some way (e.g., the sample contains a higher proportion of
motivated trainees than is found in the general population of potential
teacher-trainees), then findings from the sample may not apply to (won’t
describe) the general population.
Note that findings
and inferences may have internal validity but not external validity. That is, findings and conclusions may
accurately represent what was found in the sample studied, but may not apply to
other samples. However, if findings and conclusions do not
have internal validity, then they
surely don’t have external validity either.
The
factors that can weaken internal and external validity are called "extraneous variables.” Maturation of study participants is an
example. Change in children's skills during instruction may reflect maturation
of the nervous system and muscles as well as the effects of instruction. So, if
the research hypothesis is that instruction will increase children's skills,
the "rival hypothesis" is that maturation will increase
children's skills. That’s why it is important to identify possible extraneous
variables (sources of "contamination"). You can then design research
to weaken or eliminate the effects of these variables, or you can analyze the
data to determine what effect the extraneous variables have had. For example,
if you use an experimental and control group, and if you created the two groups
using the method of randon allocation, then you weaken the rival
hypothesis of maturation---since children in both groups have an equal chance of improving as a result
of maturation.
Extraneous Variables
That Are Threats to Internal Validity
1. Instruments
do not measure what they purport to measure.
In other words, the findings are not valid. For instance,
a. The dependent (outcome) variable is reading
proficiency. However, that is NOT what
the researcher is measuring. Instead, the
researcher is measuring behavior such as turning pages, naming parts of a book,
holding books properly, memorizing words, and guessing what words say. If the researcher is
“testing” a new method of reading instruction (a method that does NOT work),
this method will APPEAR to be effective because
the researcher is not measuring reading at all. Consumers should expect researchers to use standardized validated methods and
instruments, or expect researchers to carefully define variables, and then
develop valid measures based on these definitions.
b. The
measurement method or instrument has not
been tested for reliability; that
is, different observers or testers observing the same thing would NOT get the same scores.
If a method or instrument is NOT of known high reliability, then a group receiving an intervention may appear to have made a lot of
progress between pre-test and post-test,
but only because the post-test scores were wrong.
Therefore, consumers should expect
researchers to use methods and instruments
with known high reliability, and should expect researchers to check that observers and testers produce
reliable data before a study begins, and periodically
during a study if repeated measurement is used.
c. Data
that should be OBJECTIVE (e.g., counting
how often teachers properly correct student errors) are in fact subjective---opinions, impressions. These subjective data (“I learned a
lot!” “Training was excellent.” “I am confident that I can properly teach the
five reading skills.”) can’t be used to determine if a
program or method is effective or if teachers are proficient. Why?
For the same reason that you can use subjective opinions to determine if
a medication is effective. If a drug
works, there will be objective changes in the body. If a program works,
there will be objective changes in student
behavior. Opinions do not measure
proficiency; they measure feelings.
Also, opinions and feelings and impressions change—and therefore are not
reliable indicators of hard facts of proficiency.
2. History.
History includes events in
addition to the independent variables under study, that occur between one measurement and another
(e.g., between a pre-test and post-test). For example, in testing the effects of an
exercise program on psychological well-being following heart attack, some
participants joined a church, or received additional social support, or changed
jobs. These extraneous (history) variables may account for some of the
differences between pre- and post-test scores.
To
weaken history as a rival hypothesis, researchers should use equivalent experimental and control groups
(created by random allocation or matching).
Since the groups are, logically, likely to have the same historical variables happening between pre- and post-test, differences in the outcomes
are NOT likely to be the result of history.
3.
Maturation. Maturation refers to changes that
ordinarily occur with time (e.g., strength,
increasing knowledge). For instance,
let’s say a new method to increase children’s attention span is tested in an
experimental intervention. And let’s say
that most children are more attentive two months later, during the
post-test. The experimenter may think
this improvement is the result of the method, but the rival hypothesis is
simply that the children became more mature, and THAT increased their
attention. To handle the extraneous
variable of maturation, researchers use equivalent comparison groups, or us use
experimental designs in which the experimental group serves as its own control
(e.g., the equivalent time samples design).
4.
Testing. This refers to the effects of
taking one test on the results of a later test. For instance, improvement in
scores might reflect decreasing fear of being tested, or figuring out what
kinds of answers are correct.
Testing can be controlled
in part by using different versions of the same tests and by using comparison
groups in which one group does not receive a pre-test.
5.
Statistical regression. A person's performance
of any task varies within a certain range. On the average, you may be able to do 10
pull-ups, but on a particular day you may do 8, 9, 11, or 12. In fact, there
may be days when your performance is quite
unusual--you can barely do 5 pull-ups, or somehow you manage to do 18.
However, if you did pull-ups the next day, and the day after that, your
performance would probably regress (move)
to the mean, or your average performance.
In research, a
group's pre-test performance might (by chance) be unusually high or low; some people had a good day or a bad day. On
later testing, the group's performance regresses to the mean (i.e., is more
usual). The researcher may mistakenly treat differences between pre- and
post-test scores as the result of an intervention ("They improved.")
or as the failure of an intervention ("They got worse!"), when in
fact, the group merely turned in its average or usual performance.
The rival hypothesis
of statistical regression can be partly controlled by using equivalent comparison
groups, since the possibility of unusual scores applies equally to the groups.
6.
Selection bias. In research using comparison
groups, some participants in one
group may be different
from those in the other group(s) in ways that affect performance. For instance,
an experimental group may do much better on a post-test than the control group,
not
because the experimental intervention was effective but because more of
the experimental group members figured out how to take the test (See number 3
above.). Similarly, the pre-test/post-test differences between the experimental
and control group may be small, suggesting that the intervention did not work. However, in fact, the
control group contained many people who WERE
likely to change as a result of maturation or some historical factor, and so
they gotr high scores even though they received no
intervention.
This source of
invalidity can be handled, in part, by random allocation of participants to
comparison groups. This way, all
possibly biasing factors have an equal chance of being in both groups.
7.
Experimental mortality. This refers to
the differential loss of participants from comparison groups. For example, an
experimental intervention may appear to work only because participants with
whom it was not going to work dropped out. Similarly, an
intervention may appear to work no better than nothing at all, only because
people in the control group who would have gotten WORSE over time dropped out, leaving people in the control group
who improved. Thus, the control group scores about the same as the experimental
group.
The rival hypothesis
of experimental mortality can be partly controlled by using equivalent comparison
groups, since the chances of dropping out should be about equal in the two groups.
9.
Causal time order. Here,
participants began to change prior to an intervention, but the
researcher does not know this. It
only appears that the intervention is the cause of the change.
|
|
*
|
| *
|
| * *
| | *
*
| |
*
| *
|
|
|
|
|
|__|__|__|__|__|__|__|__|__|___
Pre-
Intervention
Test
This is what is really happening.
| | *
| | *
| | * *
| |
* *
| |
*
| * |
| * |
| *
|
|__|__|_|_|__|__|__|__|__|__|__|__|__
Repeated Intervention
Pre-tests
Notice that change during the intervention is just a
continuation of what had already begun.
A partial solution
is an extended series of repeated baseline or pre-intervention observations, to
assess the stability of performance before an intervention. If the “baseline” or pre-test scores are
stable (mostly a straight line), and scores ONLY rise AFTER the intervention
begins, you have evidence that the intervention is having an effect. For example
| | *
| | *
| | * *
| | *
*
| | *
| |
| |
| * * * |
|__|__|_|_|__|__|__|__|__|__|__|__|__
Repeated
Intervention
Pre-tests
[Stable]
10.
Diffusion or imitation. Here, part of
an intervention given to an experimental group is used by members of the control group. Thus, the intervention
does not appear to make much of a difference, because both groups have changed.
For example, families in a training program lend materials to friends in the
control group.
One way to try to
control this is to make sure that members of the comparison groups do not know
one another. Another method is not to tell participants what group they are in---a
single blind study. However, this
may pose ethical problems. Still another method is to use delayed-intervention
control groups (so that members of the control group may be more willing to
wait).
11.
Compensatory rivalry. Knowing they
are in a control group, some participants try to change on their own.
Improvement in the control group may be mistaken to mean that the intervention
is no better than no intervention. One
way to handle this is NOT to tell participants which group they are in. This is called a “single blind” study.
12.
Demoralization. Knowing they are in a control group, and not
receiving an intervention that they want, some members of the control group
look worse over time than they otherwise would. This may result in differences
between the experimental and control group being mistaken for the effects of
the intervention. (Imagine the effects on their life expectancy if people with
aides knew that they were in the control group of a drug experiment.)
A partial solution
is to use a delayed-treatment design (rather than no-treatment design). Also,
one could use alternative treatment groups rather than a control group.
Extraneous Variables That Are Threats to External
Validity
Keep in mind that all of the threats to internal
validity are also threats to external validity. Additional threats to external validity
include the following.
13.
Reactive or interactive effects of testing
"Reactive" effects of testing means that a pre-test alone
influences post-test performance. "Interactive"
effects of testing means that a pre-test influences how people are affected by
an intervention. If the
performance of an experimental group after an intervention has been influenced
by the pre-test, the findings (e.g., amount of beneficial change resulting from
treatment) may not apply to the general population which is not likely to receive a pre-test.
Therefore, it may be
important to assess the effects of pre-testing itself. An experimental design call the Solomon
Four-group Design is an effort to control this source of invalidity.
14.
Interaction of selection bias and X (intervention) Here, a bias in the selection of the
experimental group results in enough members of the experimental group being
especially likely to be affected (or not affected) by X, so that the experimental
group’s post-test scores are higher than scores of the control group. But since samples in the general population
are NOT likely to have this bias, the results of the intervention with other
samples may be less than in the experiment.
One way to handle
selection bias is to use random sampling
so that study samples are equivalent to the general population.
15.
Interactive effects of experimental arrangements. If the performance of people in an
experimental group was affected (positively or negatively) by certain features
of the experiment, or by the fact that it was seen by them as an experiment,
findings from the experimental group may not apply to samples from the general
population who will receive the intervention in a nonexperimental setting. For instance, teachers in an
experimental training program (which gives them a sense of being special) may
change more than later trainees who simply receive a course on the same
material. There is no way getting around
this one. The more you control a
situation so that you get valid data, the LESS the situation is like real life,
and therefore, the results you got in the contrived setting may not happen
outside of it. However, you CAN TEST
THAT very hypothesis by replicating the research in more and more natural
settings, and see if the results remain about the same.