Psychological Bulletin: Transfer of Test-Enhanced Learning: Meta-Analytic Review and Synthesis
Psychological Bulletin: Transfer of Test-Enhanced Learning: Meta-Analytic Review and Synthesis
Manuscript version of
Funded by:
• American Psychological Association
• National Science Foundation
© 2018, American Psychological Association. This manuscript is not the copy of record and may not exactly
replicate the final, authoritative version of the article. Please do not copy or cite without authors’ permission.
The final version of record is available via its DOI: https://dx.doi.org/10.1037/bul0000151
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
Running head: TESTING AND TRANSFER
Abstract
Attempting recall of information from memory, as occurs when taking a practice test, is one of
the most potent training techniques known to learning science. However, does testing yield
learning that transfers to different contexts? In the present article, we report the findings of the
first comprehensive meta-analytic review into that question. Our review encompassed 192
transfer effect sizes extracted from 122 experiments and 67 published and unpublished articles
testing can yield transferrable learning as measured relative to a non-testing reexposure control
condition (d = 0.40, 95% CI [0.31, 0.50]). That transfer of learning is greatest across test
formats, to application and inference questions, to problems involving medical diagnoses, and to
mediator and related word cues; it is weakest to rearranged stimulus-response items, to untested
materials seen during initial study, and to problems involving worked examples. Moderator
analyses further indicated that response congruency and elaborated retrieval practice, as well as
initial test performance, strongly influence the likelihood of positive transfer. In two
assessments for publication bias (using PET-PEESE and various selection methods), the
moderator effect sizes were minimally affected. However, the intercept predictions were
substantially reduced, often indicating no positive transfer when none of the aforementioned
moderators are present. Overall, our results motivate a three-factor framework for transfer of
test-enhanced learning and have practical implications for the effective use of practice testing in
meta-analysis
TESTING AND TRANSFER 3
The present meta-analysis found that practice testing can result in learning that
generalizes to different situations and different test types. That transfer of learning is greatest
across test formats, to application and inference questions, to problems involving medical
diagnoses, and to tests involving mediator or related word cues. It is weakest to rearranged cues
and responses, to unpracticed information that was seen during prior study, and to problems
The act of attempting to recall information from memory, as occurs when taking a test,
provides not only an assessment of prior learning but also a potent new learning opportunity.
That finding is the chief result of more than 200 studies from over a century of research
(beginning with Abbott, 1909), in confirmation of earlier anecdotal observations (e.g., James,
1890). Studies showing the benefit of testing for memory – more formally known as
test-enhanced learning, the testing effect, or the retrieval practice effect – commonly utilize a
three-phase experimental paradigm that begins with (a) initial study of a set of to-be-learned
materials (e.g., word lists or text passages), followed by (b) training on those materials via
testing or, for comparison purposes, a non-testing reexposure control condition (e.g., restudy),
and ending with (c) a final test. On that final test, materials that were initially tested are usually
better remembered than those that were not. Test-enhanced learning has been demonstrated
across a wide range of materials (for a listing, see Rawson & Dunlosky, 2011; for reviews, see
Bjork, 1975; Dempster, 1996; Rickard & Pan, 2017; Roediger & Butler, 2011; Roediger &
Karpicke, 2006; Roediger, Putnam, & Smith, 2011; for meta-analyses, see Adescope, Trevisan,
& Sundararajan, 2017; Rowland, 2014), with a variety of test types (e.g., McDaniel, Wildman &
Anderson, 2012; Pan, Gopal, & Rickard, 2015), with and without correct answer feedback (i.e.,
being shown the correct answer) after attempting retrieval (e.g., McDaniel, Bugg, Liu, & Brick,
2015; Rowland & DeLosh, 2015b), across a variety of retention intervals (e.g., Carpenter,
Pashler, Wixted, & Vul, 2008; McDaniel, Howard, & Einstein, 2009), and with individuals of
diverse ages (e.g., Carpenter et al., 2016; Jones et al., 2015; Meyer & Logan, 2013) and different
memory abilities (e.g., Agarwal, Finley, Rose, & Roediger, 2016; Pan, Pashler, Potter, &
Rickard, 2015).
Given the strong evidence for its memorial benefits, many cognitive and educational
TESTING AND TRANSFER 4
psychologists now classify testing as among the most effective educational techniques
discovered to date. These researchers emphasize that tests are beneficial not just for assessment,
but also as powerful learning tools in and of themselves (i.e., in the form of practice or no-stakes
tests; for discussions see Benjamin & Pashler, 2015; Bourne & Healy, 2013; Brown, Roediger, &
McDaniel, 2014; Fiorella & Mayer, 2015; Karpicke, 2012; McDaniel, Roediger, & McDermott,
2007; Pashler, Rohrer, Cepeda, & Carpenter, 2007; Rawson & Dunlosky, 2012; Roediger & Pyc,
training methods from the U.S. National Center for Education and the National Research Council
(Druckman & Bjork, 1994; Pashler, Bain, et al., 2007), is highlighted in a recent comprehensive
review of effective learning techniques from cognitive and educational psychology research
(Dunlosky, Rawson, Marsh, Nathan, & Willingham, 2013), and has begun to attract attention
from the mainstream media (e.g., Carey, 2013; Lahey, 2014; Paul, 2016).
Most studies of test-enhanced learning feature identical materials (i.e., test questions)
during both initial and final tests. Although important theoretically and in some educational
reactions in multiple STEM fields), in numerous situations more flexibly applicable learning is
needed. For instance, instructors frequently eschew revealing exam questions during classroom
lessons, practice quizzes, and other training activities (e.g., Balch, 1998; Popham, 2011;
Wooldridge, Bugg, McDaniel, & Liu, 2014; Mayer, 2009). Moreover, in ecologically valid
circumstances one cannot expect always having to recall the same information or correctly
anticipating the manner in which previously learned information will need to be used. Instead,
all to-be-learned materials; as such, it would be desirable to know if and when the benefits of
testing on a subset of materials can yield benefits for other, not directly tested materials.
The aforementioned scenarios raise the issue of transfer of learning (i.e., the use of prior
learning in a new context or contexts; for a more detailed definition and specific examples, see
the next section). Transfer of learning is commonly described as a paramount goal or even the
“holy grail” of education (Druckman & Bjork, 1994; Haskell, 2001; McDaniel, 2007). The
critical question arises: beyond aiding retention, does testing enhance the transfer of learning
Two reviews of the test-enhanced learning literature, Roediger and Butler (2011) and
Roediger, Putnam, et al. (2011), included subsections on transfer; in both articles, the conclusion
(based on the limited evidence available) was that testing does indeed generate transferrable
learning. Roediger, Putnam et al. included transfer among their “ten benefits of testing”, of
which two were “testing improves transfer of knowledge to new contexts” and “testing can
facilitate retrieval of material that was not tested” (pp. 14-20). Similar statements can be found
in recent empirical articles on test-enhanced learning, including Butler (2010), Rohrer, Taylor,
and Sholar (2010), and Carpenter and Kelly (2012), as well as in articles on the technique written
for the general public (e.g., Lahey, 2014; Paul, 2016; Swaminathan, 2006).
Carpenter (2012), in a brief review that was the first and, prior to this writing, only paper
to specifically focus on this topic, highlighted over two dozen studies and concluded that testing
can yield transferrable learning, but noted that further research is needed to gain a more
comprehensive understanding of that transfer. Since that review, the literature on transfer of test-
enhanced learning has grown exponentially, now exceeding 70 studies. It contains a diverse set
of experiments that vary in terms of transfer contexts (e.g., involving the same vs. different cues;
contexts to be further detailed later in this article), types of initial tests (e.g., free vs. cued recall),
TESTING AND TRANSFER 6
and other potentially critical experimental design features (e.g., brief or long retention intervals;
In light of that growth, it is broadly agreed in the field that a new, comprehensive review
is needed. In the present article, we address that need through meta-analysis of 192 effect sizes
from 122 experiments and 67 articles in which transfer was measured relative to a non-testing
reexposure control condition. That analysis provides, for the first time at the level of the
literature, statistically-based insight into the conditions under which transfer occurs, important
Drawing on prior literature (e.g., Carpenter, 2012; Gick & Holyoak, 1987; Haskell, 2001;
McGeoch, 1942; Roediger, 2007), the definition of transfer used throughout this review is the
productive use of prior learning in a novel context. What exactly constitutes a “novel context”?
In transfer research, a novel context can potentially refer to any situation that is different in some
way from that in which original learning took place (McDaniel, 2007). This may include a
different topic, a different goal, a different test type, or any number of other contextual changes
(for a taxonomy, see Barnett & Ceci, 2002). For example, if information that is trained via a free
recall test is later assessed on a final multiple-choice test, then that final test constitutes a novel
context (i.e., transfer across test formats). Alternatively, if prior learning needs to be integrated
with new information on a final application test, then that application test constitutes a novel
context (i.e., transfer to application questions). In another example, if learners are trained to
recall words given specific cues (e.g., given mother, recall child), and then have to recall those
words in response to different cues on a final test (e.g., given father, recall child), then that final
test also constitutes a novel context (i.e., transfer to mediator word cues). The list of novel
Some contextual changes are more extensive than others. For instance, a change in test
format is typically regarded as less substantial than the combination of a change in subject matter
and a switch to application questions. In the transfer literature, the range of possible novel
contexts is often dichotomized into near transfer (i.e., relatively minor) and far transfer (i.e.,
extensive or multiple changes) categories (Barnett & Ceci, 2002; Perkins & Salomon, 1994).
Some transfer researchers argue that relatively minor contextual changes (i.e., “near” transfer)
constitute “ordinary learning” and should not be considered as involving transfer (Perkins &
Salomon, 1994), although there is no absolute dividing line between ordinary learning and
transfer. Drawing on that precedent, in this review we did not consider studies in which the
contextual change was solely the passage of time or a change in physical location as involving
transfer.1 For the current purposes, such changes were too minor to constitute meaningful
transfer (i.e., they represent ordinary learning). Overall, our review encompassed a wide range
of educationally, practically, and theoretically meaningful transfer contexts – including six major
transfer categories that span from “near” to “far” transfer (namely transfer across test formats, to
and inference questions, of problem-solving skills, and to mediator and related word cues; each
are defined in subsequent sections of this review) – that comprise the vast majority of the
1
For studies involving transfer of test-enhanced learning, the retention interval between the training phase and final
test is typically equivalent across the following categories: (a) items that were not tested (e.g., the restudied items)
during training and only tested on the final test, (b) items that were tested during training and tested in an identical
way on the final test (yielding the testing effect as defined in this review), and (c) items that were tested during
training and then tested in a different context on the final test (yielding the transfer effect). Hence, the effect of
retention interval on final test performance should be similar for the non-testing reexposure control and transfer
conditions on the final test, the two conditions through which the transfer effect is measured.
TESTING AND TRANSFER 8
First, after (a) an initial study phase on a set of to-be-learned materials, which we will refer to as
initial study, those materials are (b) practiced in a training phase via testing or a non-testing
method. We will use initial test to describe training through testing, and the non-testing method
will be described generally as the non-testing reexposure control (when discussing individual
studies, we will refer to the non-testing reexposure control by the task that is used, such as
restudy or rereading). Finally, after a common retention interval, prior learning is assessed via
(c) a final test (i.e., criterial test). That final test allows comparison of learning and retention that
occurred via testing vs. the non-testing reexposure control condition. In some cases, the final test
includes both transfer and non-transfer questions; when discussing such cases, we will
differentiate final test questions or tests that directly assess transfer by using the term transfer
also distinguish between the effects of testing where transfer is and is not involved (i.e., testing’s
effects on transfer vs. verbatim retention). We investigated the former case (i.e., transfer of test-
both types of effects (test-enhanced learning vs. transfer of test-enhanced learning). For brevity,
the term testing effect will be used to refer to the case of identical contexts, retrieval cues, and
required responses on initial and final tests (which could also be described as “conventional test-
enhanced learning” or a “retention effect”),2 and the term transfer effect will apply to the case of
differences in either cues or required responses (or both) on the initial and final tests (i.e., the
2
Both testing and transfer effects can be assumed to be a result of testing’s effects on either learning, memory, or
both. Thus, “test-enhanced learning” and “transfer of test-enhanced learning”, respectively, would be perhaps the
most accurate descriptors (and would have been used in this review if not for their length).
TESTING AND TRANSFER 9
learning”). In this review, both testing and transfer effects are measured relative to final test
shorthand, simply a transfer effect) will refer to final test performance that is superior to that in
the control condition, and a negative transfer effect will refer to the opposite case (McGeoch,
1942; Haskell, 2001). The use of “transfer” as a verb can be assumed to refer to a statistically
A comprehensive discussion of all theories and research perspectives from the test-
enhanced learning and broader transfer literatures is beyond the scope of this review (for
discussions of the former, see Delaney, Verkoeijen, & Spirgel, 2010; Karpicke, Lehman, & Aue,
2014; Roediger & Butler, 2011; Roediger & Karpicke, 2006; van den Broek et al., 2016; for
coverage of the latter, see Cormier & Hagman, 1987; Ellis, 1965; Haskell, 2001; McGeoch,
1942; Mestre, 2005; Singley & Anderson, 1989). However, several theoretical perspectives
accounts of test-enhanced learning do not directly address transfer (e.g., Halamish & Bjork,
2011; Kornell, Bjork, & Garcia, 2011; Mozer, Howe, & Pashler, 2004), the following three
theories suggest a process mechanism that incorporates it. First, the elaborative retrieval
hypothesis (Carpenter & DeLosh, 2006; see also Carpenter, 2009) posits that a process of
spreading activation occurs during the search for correct answers on tests (cf. ACT-R, Adaptive
Control of Thought-Rational and SAM, Search of Associative Memory; Anderson, 1996; Collins
& Loftus, 1975; Raajmakers & Shiffrin, 1981); as a result, multiple retrieval routes are created
which aid later recall, resulting in the testing effect. Transfer effects may also result from the
TESTING AND TRANSFER 10
same mechanism: when information that is semantically related to previously tested information
needs to be recalled on a transfer test, the process of spreading activation that presumably
occurred during initial testing increases the likelihood that such information will be recallable as
well (Carpenter, 2011; Chan, 2009; Chan, McDermott, & Roediger, 2006; Cranney, Ahn,
McKinnon, Morris, & Watts, 2009). Second, the mediator effectiveness hypothesis (Pyc &
Rawson, 2009), posits that mediators (i.e., a word, phrase, or concept that links a cue with a
target) activated during testing support improved final test performance. By that account, testing
can also be expected to improve performance when the mediators themselves, or other
information linked via mediators, need to be recalled on a transfer test (Coppens, Verkoeijen,
Bouwmeester, & Camp, 2016). Finally, the recently-proposed dual memory theory of test-
enhanced learning (Rickard & Pan, 2017) constitutes a viable framework from within which to
account for results in some cases. According to that theory, test-enhanced learning stems from
the fact that two routes to retrieval are accessible for a tested response (i.e., via “study memory”
from initial study or “test memory” from the initial test). However, when different responses are
required on a transfer test, that theory, in a slightly elaborated form (see Rickard & Pan, 2018),
predicts that only study memory is accessible. Under such circumstances, testing is predicted to
Thomas, Agarwal, McDermott, and Roediger (2013); Avci (2011; see also Schmidt & Bjork,
1992); and others have speculated that testing may generally yield learning that is more
descriptive accounts imply that test-enhanced learning will generally yield transfer.
Additionally, McDaniel, Howard, et al. (2009); McDaniel and Little (in press); Nguyen and
McDaniel (2016); Pan, Gopal, et al. (2015); Pan and Rickard (2017); and van Eersel, Verkoeijen,
TESTING AND TRANSFER 11
Povilenaite, and Rikers (2016) have suggested that activities associated with but separate from
the act of testing itself (such as the processing of feedback, more effective subsequent restudy,
and more focused attention) may influence the likelihood of transfer of test-enhanced learning.
intersects with a long-running debate in the broader transfer literature between two prominent
and highly influential theoretical perspectives: (a) the identical elements and related similarity-
based models of transfer, and (b) the general principle and other abstractionist models. Those
discussions, see Allport, 1937; Barnett & Ceci, 2002; Detterman, 1993; Dudai, 2007; Healy,
2007; Kelly, 1967; Mestre, 2005; Sternberg, 1993). In the former, transfer is commonly
restricted to situations in which the training and transfer contexts are highly similar to one
another (Thorndike, 1906; see also Ebbinghaus, 1885). That similarity may be at the level of
cues, responses, available knowledge, mental states, and/or abstract mental representations (for
discussions see Morris, Bransford, & Franks, 1977; Rickard & Bourne, 1996; Rickard, Healy &
Bourne, 1994; Thorndike, 1906; Tulving, 1970, 1984; Singley & Anderson, 1989). In contrast,
the general principle and other abstractionist models suggest that the learning of underlying
principles (e.g., properties of actions, operations, perceptions, etc.) can facilitate transfer to
contexts that are substantially dissimilar from those that were encountered during training (Judd,
1908; see also Gick & Holyoak, 1980; Simon & Hayes, 1977). According to this perspective,
transfer can be increased by making learners aware of relevant information needed for successful
transfer (e.g., by training with multiple or varied examples, or by informing learners to apply
principle.
integrative frameworks. Perkins and Salomon (1994; see also Salomon & Perkins, 1989)
proposed that transfer can occur in “low” circumstances when the stimuli are the same or similar
to those that were previously learned as well as in “high” circumstances where learning (i.e., a
search for general principles) occurs at a more abstract level. Barnett and Ceci (2002) proposed
that all transfer, whether through identical elements or general principles, requires successful (a)
recognition of the need to transfer prior learning to the new context, (b) recall of the relevant
knowledge, and (c) execution of prior learning in the new context. Both integrative frameworks
allow for the possibility that successful transfer can be very difficult to obtain in various
circumstances.
Method
Literature Search
test-enhanced learning, we first conducted a preliminary analysis of recent empirical and review
articles, and then undertook an extensive formal literature search. Included were online database
searches for peer-reviewed research articles, dissertations, and theses; ancestral searches of
empirical and review article reference lists; and listserv queries and correspondence with authors
to obtain additional data and unpublished manuscripts. No date restriction was applied during
Preliminary searches. Due to the lack of standard terminology for transfer studies in
this literature (initial database searches with the keyword transfer in conjunction with test-
enhanced learning and its synonyms yielded only a portion of the studies that are known to
exist), we examined the Carpenter (2012) review article, three reviews of test-enhanced learning
with subsections that addressed transfer (Roediger & Butler, 2011; Roediger & Karpicke, 2006;
Roediger, Putnam, et al., 2011), as well as recent empirical articles to identify types of studies
TESTING AND TRANSFER 13
that involve testing and transfer but do not explicitly use transfer terminology. That preliminary
search revealed that the vast majority of studies involving transfer of test-enhanced learning do
not necessarily discuss transfer per se (cf. Adescope et al., 2017). Rather, many studies use
terms that are specific to the transfer context or contexts under investigation (e.g., test formats).
Accordingly, we compiled a list of the different transfer types that have been investigated in the
literature to date. Keywords addressing these (for a complete listing, see the next section) were
Database searches. Two online databases were queried for empirical research articles:
PsycINFO and ProQuest Dissertations and Theses. Using these databases, a total of 72 separate
searches were performed using the keywords test-enhanced learning, testing effect, practice
testing, and retrieval practice in combination with the terms transfer, format, related,
were intended to broadly capture any studies that may involve testing and transfer, plus address
studies from the aforementioned major transfer categories. The searches yielded 383 hits; 103
were duplicates, leaving 280 database records (212 peer-reviewed articles and 66 dissertations,
dating from as early as 1963 and as recent as 2016) for further examination. These records were
entered into a three-stage review process to determine suitability for inclusion in the meta-
analyses. That process, based on that detailed in a prior meta-analytic review on an unrelated
topic (Pan & Rickard, 2015) and summarized in PRISMA (2009; see also APA Publications and
Communications Board Working Group on Journal Article Reporting Standards, 2008) diagram
format in Figure 1, was first completed for the PsycINFO database search results and is detailed
as follows.
The first stage, title-level review, involved both authors of this review separately
TESTING AND TRANSFER 14
screening each title for (a) any mention of test-enhanced learning research, as well as (b) any
mention of transfer. If either condition held or if the title was ambiguous, it was flagged for
potential inclusion. If the title clearly indicated that the article did not address testing or transfer,
or stated that it was a review, commentary, or did not involve the standard test-enhanced learning
forgetting), it was eliminated from consideration (cf. Rowland, 2014). All articles flagged by at
least one rater were retained for the next stage. Of the 212 peer-reviewed articles entered into
the first stage of review, 110 were excluded and 102 survived. Overall interrater agreement was
The second stage, abstract-level review, involved the same two raters separately reading
each article abstract to verify whether both conditions (a) and (b) from the first stage applied.
Similar to the first stage, if either rater determined that the necessary conditions applied or that
the abstract was too ambiguous for a definitive rating, then the article was flagged for potential
inclusion. Additionally, if the abstract indicated that only clinical populations were involved,
then the article was excluded. Of the 102 articles entered into the second stage of review, 41
The third and final stage, article-level review, involved the first author of this review
examining the full text of each article to determine whether it unambiguously met a set of five
inclusion criteria (which are detailed later in this section) to qualify for meta-analysis, as well as
to verify that it did not violate any of the exclusion rules from the preceding stages. In nine
instances where an article contained ambiguities, the final inclusion decision was made by both
authors discussing and arriving at mutual agreement. Of the 69 articles entered into the third
Owing to the good interrater agreement that was observed for the PsycINFO search
TESTING AND TRANSFER 15
results, the 66 unpublished results from the ProQuest Dissertations and Theses database were
screened by the first author only. Seventeen records survived title- and abstract-level review; of
these, six dissertations passed article-level review and were included in the meta-analyses.
Ancestral searches. In an effort to obtain further studies, the reference lists of all studies
that survived the three-stage screening process, as well as those of six review articles or chapters
addressing test-enhanced learning and/or transfer (including the four aforementioned articles that
were consulted in the preliminary searches, as well as the reference lists of articles in Rawson &
Dunlosky, 2011 and Rowland, 2014) were examined. Sixty-three unique references were
identified in this manner. All of these references survived title- and abstract-level review; 21
Unpublished studies. To address publication bias and the “file drawer” issue (Strube &
transfer of test-enhanced learning. The list of contacts was drawn from listservs of researchers in
the fields of learning, memory, cognition, and instruction, as well as lists of authors of studies
already included in the meta-analyses. In response to our request (issued on May 15, 2016), we
received 15 responses and obtained the full text of 10 unpublished manuscripts (and were also
referred to articles and dissertations that we had already obtained); of these, six met article-level
inclusion criteria and were included in the meta-analyses. In four cases (Cho et al., 2017;
Eglington & Kang, 2016; additionally, Pan, Hutter, et al., 2018; Rickard & Pan, 2018), we
obtained or had an in-progress or partially redacted manuscript; each of these had sufficient
information to determine study eligibility and to extract effect size and other necessary data.
Inclusion criteria for the article-level review stage. At the final review stage, all
studies from the database and ancestral searches, as well as unpublished works solicited via
author correspondence, were screened against a set of five inclusion criteria. The purpose of
TESTING AND TRANSFER 16
these criteria was to verify that all included studies, experiments, or conditions had specific,
clearly identifiable experimental design features and contained sufficient data for quantitative
meta-analyses. Exclusion of individual studies or experiments was done solely on the basis of
these criteria and was not the result of any assessment of study quality or outcome. The five
criteria were:
1. The most common three-phase test-enhanced learning paradigm must have been used.
This paradigm, which we noted earlier, involves three phases: first, initial study of to-be-
learned materials; second, an intervening training phase on those materials which features
a testing vs. a non-testing reexposure control manipulation; and third, a final test. This
dropout schedules, or had the presentation of new and different to-be-learned information
during the training phase (e.g., studies of test-potentiated new learning), as well as studies
of adjunct prequestions (for reviews of that literature, see Anderson & Biddle, 1975;
2. Transfer must have been assessed relative to a non-testing reexposure control. Multiple
types of non-testing reexposure controls have been used in the test-enhanced learning
notetaking (among those, restudy is the most common). The requirement that a non-
testing reexposure control be used reflected prior assertions (e.g., Carrier & Pashler,
1992; Carpenter & DeLosh, 2006; Kuo & Hirshman, 1996; Rowland, 2014) that studies
in which testing is compared against a no-training condition (i.e., materials in the control
condition were not presented in any form during the training phase) preclude any
objective assessment of testing’s benefits relative to any other learning strategy (for
similar observations on the importance of the control condition in the broader transfer
TESTING AND TRANSFER 17
examine whether testing can yield transfer relative to a non-testing learning activity
rather than no learning activity. Included studies fell into one of two widely-used
experimental design types. In the first type, only the transfer effect is assessed on the
final test. In the second type, both testing and transfer effects are assessed on the final
test.
3. Transfer must have been specifically assessed on the final test and separately reported.
Performance on final test questions that address transfer must have been reported apart
from any final test questions that did not involve transfer. Studies in which data from
transfer and non-transfer questions were not separated were excluded on this basis (for
related discussion see Butler, 2010). Additionally, the exact transfer category under
4. Proportion correct must have been the dependent measure on the final test. In the vast
proportion correct ranging from 0 to 1.0. Studies that reported data in that manner, as
well as studies in which that data could be derived (e.g., number of points earned out of a
maximum possible total) were included. For five studies in which a recognition final test
was used, proportion correct was used where it was reported as the dependent measure
(e.g.., Bies-Hernandez, 2013; Huff, Balota, & Hutchison, 2016; Verkoeijen et al., 2012)
or was derivable from reported mean rates of hits (e.g., Carpenter, 2011) or hits minus
false alarms (e.g., Jacoby et al., 2010). (It should be noted, however, that proportion
correct in the case of recognition does not account for response criterion effects and is an
incomplete measure of performance; for discussion see Stanislaw & Todorov, 1999).
TESTING AND TRANSFER 18
5. All necessary information for effect size calculations must have been reported or
derivable. Effect size, sampling variability, sample size, type of experimental design
(between- or within-subjects), and the relevant test statistics and degrees of freedom for
pairwise comparisons (e.g., transfer performance in the testing vs. non-testing reexposure
control conditions) must have been provided in the text, be derivable from figures in the
article (using the pixel-based graphical measurement technique described in Pan &
studies involved data that was non-independent in some fashion (i.e., multiple transfer conditions
compared against the same reference condition, a transfer condition compared against multiple
reference conditions, repeated final tests, or data collapsed across conditions). Our criteria for
1. Each transfer effect size must have been derived from non-overlapping experimental
means. In some experiments, there were an uneven number of testing and non-testing
reexposure control conditions. These fell into two broad categories: (a) a greater number
of testing conditions than non-testing reexposure controls (e.g., a free recall test, a cued
recall test, and a restudy training condition), or (b) multiple non-testing reexposure
notetaking and rereading conditions compared against a single test condition). In both
circumstances, the multiple pairwise comparisons that are calculable between testing and
non-testing conditions are non-independent. For cases involving (a), one pairwise
comparison was chosen at random for inclusion in the quantitative meta-analyses (effect
sizes that were not included in those analyses are indicated by asterisks in Table 1). For
cases involving (b), the non-testing reexposure control condition that most closely
TESTING AND TRANSFER 19
matched restudy (i.e., the most common reference condition in this literature) was
included. Where there were multiple reexposure controls involving restudy, the
the transfer condition was included. For example, in Butler (2010; Experiment 2), there
were three training conditions: testing, restudy of isolated sentences, and restudy of
passages. Given that the testing condition involved viewing feedback in the form of
isolated sentences (and not whole passages), the included reexposure control condition
2. Data from studies with multiple identical final tests must not have been confounded by
the effects of a prior identical final test. In some studies, subjects completed the same
exact final test multiple times, such as immediately after training and then again after a
delay (e.g., the previously tested items condition in McDaniel, Howard, et al., 2009) or
across multiple test blocks. In such situations, only data from the first test for an item
was included, as the results of that test constitute the purest measures of the retention and
transferability of learning from the training phase. For studies which had independent
between-subjects assignment to an immediate and delayed final test, data from both tests
were included. For studies which used within-subjects assignment to immediate and
delayed final tests, but in which independent and randomly assigned materials (e.g., two
separate text passages) were used on the two tests, data from both tests were included
3. Data collapsed across conditions or experiments were included and identified as such if
no other inclusion criteria were violated. In some studies, results were only reported for
data collapsed across experiments or across conditions (e.g., different retention intervals).
Provided that no other inclusion criteria were violated, those results were included as
TESTING AND TRANSFER 20
such in the meta-analyses and noted in Table 1 in the following manner: where multiple
experimental conditions were collapsed together, those conditions are denoted with a dual
cross symbol; where multiple experiments were collapsed together, the experiment
numbers are presented side-by-side in the table; where multiple retention intervals are
collapsed together, the delay interval in hrs. is the average of those intervals.
Further criteria for studies of transfer across test formats. Two additional rules
applied to studies of transfer across test formats: test format must have been the only change
between the initial and final test (and not a change in assessed content). Studies excluded on this
basis remained eligible for inclusion in other categories (throughout the dataset, each effect size
was included in only one category). Additionally, studies in which subjects completed a final
test in the same format as during training (i.e., a test condition), plus completed another final test
in a different format as during training (i.e., a transfer condition), were not eligible for inclusion
if the test condition preceded the transfer condition. This rule was implemented to avoid
including any data in which the effect of a change in final test format was contaminated by a
Outliers. We did not specifically identify, nor exclude, outlier effect sizes. All data that
or additional data; all but one responded, and nine authors were able to provide the requested
information within the requested three-month period. In one other case (Coppens et al., 2016),
we were able to derive the necessary information from a dataset made publicly available on the
effect sizes from 122 experiments met the criteria for inclusion in the overall and category-level
TESTING AND TRANSFER 21
meta-analyses. Of these, 53 studies had been published and 14 were unpublished by the
conclusion of the literature search period. The publication, completion, or submission dates of
these studies ranged from 1975 to 2016, with the vast majority (60 studies) finished in 2006 or
later. All studies but one (Zhou, Ma, Li, & Cui, 2013) were written in English. Nearly all were
performed using samples recruited from young adult (i.e., university student) populations;
exceptions included 3 studies involving elementary school children, 2 studies with high school
students, and 2 studies with older (50-66 yrs in age) adults. Descriptive and statistical
information for each study, including stimulus type, delay interval, test format, condition
identifiers, sample size, effect size, and sampling variability, are included in Table 1. Forest
plots depicting each effect size across the reviewed literature are included in Figure 2.
The current body of research on transfer of test-enhanced learning can be organized into
six major categories. We defined these categories based on piecewise discussions in the current
literature and our judgment. Although they should be treated as preliminary, we believe that
they reflect the structure of the literature – as well as some of the major distinctions in
underlying cognitive processes – to a first approximation. The six categories are presented in the
same order throughout this review; this order follows a general pattern of increasing divergence
between the initial and final tests (ranging from relatively “near” to, in some circumstances, “far”
on the near vs. far transfer dichotomy). One exception is transfer to mediator and related word
cues; that category is included last due to its having the fewest articles, which precluded all but
the simplest meta-analyses. As is evident below, the categories investigated to date represent a
considerable range of contextual changes and involve different types of information being
Test format. In this category, the final test format is different from the initial test
format, but no other major types of transfer are involved. An example is Kang, McDermott, and
Roediger (2007), which included conditions in which subjects trained on previously read text
passages via multiple-choice tests (e.g., “Source confusion is…? with four answer choice
options) and then took final cued recall tests on the same information (e.g., “Source confusion
is…?” without any provided answer choices); the correct answer (e.g., “misattributing content of
a memory to the wrong source”) was the same on both tests. Studies in this category may
potentially use any of the following four initial or final test formats: free recall (i.e., recall as
much of a text as one can remember), cued recall (i.e., fill-in-the-blank, fragment completion, or
short answer questions), multiple-choice (with between four to six answer options), and
recognition (i.e., two-alternative forced choice or scale judgment old/new questions). In the
literature, six combinations of transfer across test formats have been investigated: free recall to
cued recall (e.g., Karpicke & Blunt, 2011), free recall to recognition (e.g., Verkoeijen et al.,
2012), cued recall to free recall (e.g., Halamish & Bjork, 2011), cued recall to recognition (e.g.,
Carpenter, 2011), cued recall to multiple-choice (e.g., Nungester et al., 1982), and multiple-
choice to cued recall (e.g., Pan, Gopal, et al., 2015). Many of these format combinations are
further discussed in Duchastel (1981); Foos and Fisher (1991); Hanawalt and Tarr (1961); Hogan
and Kintsch (1971); Mandler and Rabinowitz (1981); McDermott, Agarwal, D’Antonio,
Roediger, and McDaniel (2014); Rickard and Pan (2017); Runquist (1983); Smith and Karpicke
the stimulus and response on the initial test are also present on the final test, but with the cue and
response roles of those elements reassigned. An example is Carpenter, Pashler, and Vul (2006),
in which subjects first studied a set of paired associates (e.g., beach, blanket). They next
TESTING AND TRANSFER 23
practiced recall of one word from each paired associate (e.g., beach, ?), and on the final test were
tested on the reverse case (e.g., ?, blanket). Another example is Pan, Gopal, et al. (2015), in
which subjects took initial tests on one term of a multi-term fact (e.g., Overlord, an operation led
by Eisenhower, began with the invasion of WHERE?”), and on the final test had to recall a
different term (e.g., “Overlord, an operation led by WHOM, began with the invasion of
Normandy?”).
Studies in this category fall into one of four subtypes: paired associates (as in the
aforementioned Carpenter et al. example), triple associates (e.g., training on a word triplet such
as “gift, rose, wine” via “gift, rose, ?”, and later being assessed on “?, rose, wine” as in Pan,
Wong, et al., 2016), multi-term facts (as in the Pan, Gopal, et al., 2015 example), and term-
definition facts (e.g., training on “Vision is the ability to see” via “WHAT is the ability to see?”
and later being assessed using the question, “Vision is WHAT?”, as discussed in Pan and
Thomas et al. (2013); Pan, Wong, et al. (2016); and Rohrer et al. (2010).
Untested materials seen during initial study. In this category, the final test assesses
information that was initially studied but neither tested nor otherwise re-exposed during training.
An example is Nungester and Duchastel (1982), in which subjects read a historical text passage
(a 1,700-word text titled “The Victorian Era”), followed by an initial test on some aspects of that
passage (e.g., “What nationality was Prince Albert?”). The final test then assessed other parts of
the passage that were not trained (e.g., “Where was the Crimean War?”). Although similar to
stimulus-response rearrangement in that the final test assesses a previously seen but untrained
response, this category is unique in that the final test questions were not presented or tested in
A notable characteristic about the literature on transfer to untested materials is that its
TESTING AND TRANSFER 24
constituent studies vary greatly in the degree of semantic relatedness between tested and untested
materials. In some cases, closely related principles, facts, or details are assessed (e.g., in Chan et
al., 2006, an initial test question was “The largest toucan species is?” and the final test question
was “The most colorful toucan species is?”; both questions referred to (relatively) highly related
information that was located in adjacent portions of an initially studied article). By contrast, in
other studies there is no obvious relation between initial and final test questions (outside of the
fact that both stem from the same general source, as in the aforementioned Nungester and
Duchastel example). However, the degree of “relatedness” between initial and final test
questions defies simple categorization (we considered but dropped the use of sub-categories
ranging from “same or linked concepts” to “generally related” content; for an attempt using
Latent Semantic Analysis, see Chan et al., 2006). The similarity issue is further discussed in
Cranney et al. (2009); Hamaker (1986); Little (2011); and Wooldridge et al. (2014).
Application and inference questions. In this category, the final test requires learners to
relate prior learning to new but conceptually related information (application), such as a new
example, scenario, or goal (Brookhart, 2015; Mayer, 2009), or to integrate prior learning in a
new way but not typically with new information (inference), such as having to uncover (i.e.,
infer) a general principle (McNamara & Kintsch, 1996). In some cases, a mixture of application
and inference questions is used. An example is Johnson and Mayer (2009), in which subjects
took tests after watching a multimedia presentation on lightning formation (e.g., “Please write
down an explanation of how lightning works”), followed by a final test featuring application and
inference questions on that topic (e.g., “What could you do to decrease the intensity of
lightning?”; “Suppose you see clouds in the sky but no lightning; why not?”). Another example
is McDaniel, Howard, et al., (2009; Experiment 2), in which subjects freely recalled as much as
they could remember about a text passage on brakes, followed by a final test featuring
TESTING AND TRANSFER 25
application (e.g., “What could be done to make brakes more effective?”) and inference (e.g.,
“Why do brakes get hot?”) questions. By our analysis, studies in this category have featured
application questions only, inference questions only, a mix of application and inference
questions, or questions combining both types. That analysis relied on the definitions stated here,
which we developed due to the fact that this category lacks a common definition of an
types, their descriptions in articles, and comparisons with other relevant literatures.
new data or an example in the context of prior learning, such as by identifying the most
appropriate concept that matches that example), comparison or contrast (i.e., determining
similarities or differences between new data and prior learning), prediction (i.e., determining
how a system is affected by a new situation), redesign (i.e., modifying a system to achieve a new
inference questions include bridging inferences (i.e., integrating multiple pieces of information
that were presented separately), conceptual inferences (i.e., uncovering an underlying or overall
inferences (i.e., determining a main argument or thesis), and other types. For further discussion
of application and/or inference question types, see Brookhart (2015); Gasparinatou and
Grigoriadou (2013); Marzano, Pickering, and Pollock (2005); Mayer (2001); and McNamara and
Kintsch (1996).
problem-solving procedure, a final test involves recall and execution of that procedure to solve a
new but related problem. Studies in this category fall broadly into two main sub-categories: (a)
medical diagnosis and treatment and (b) worked examples. An example of the former is
TESTING AND TRANSFER 26
Kromann, Jensen, and Ringsted (2009), in which medical students studied and took initial tests
on cardiac resuscitation procedures, and then used those procedures to address similar patient
scenarios with modified demographics and/or symptoms on a transfer test (e.g., “You’re about to
establish I.V. access when your patient, a 75 yr old man, becomes unresponsive. You are now
required to manage this patient.”). An example of the latter is van Gog, Kester, and Paas (2011),
in which students studied worked examples (i.e., a problem in which the solution steps are
shown, providing a step-by-step guide on how to arrive at the correct solution) of circuit
troubleshooting problems, took practice tests on those problems (e.g., “Determine how this
circuit should function using Ohm’s law”), and then attempted to solve new problems with
different values and often greater complexity (e.g., again determining how a new circuit should
function, but in this instance there are two circuit faults rather than one as seen previously).
While similar to application questions in that new information is commonly presented on the
transfer test (and, in some cases, malfunctions in a system need to be identified), the problem-
solving category is unique in that training is focused on learning a sequence or set of to-be-
executed procedures. The types of problems used in this category are further discussed in
Karpicke and Aue (2015); Larsen, Butler, and Roediger (2008); Leahy, Hanham, and Sweller
Mediator and related word cues. In this category, after training on paired associate
words or word lists via a cued recall test, a final cued recall test involves recall of the same
words but in response to different (i.e., mediator or related word) cues. An example is Carpenter
(2011; Experiment 2), in which subjects took cued recall tests without feedback on paired
associates (e.g., mother, ? for which the answer is child), followed by a final test in which they
again attempted to retrieve target words, but in response to mediator word (e.g., father, ?) or
related word (e.g., birth, ?) cues. Mediator cues are words that have strong preexisting semantic
TESTING AND TRANSFER 27
associations with cues (e.g., father is a mediator for the cue mother), whereas related cues are
words that are weakly related to targets (e.g., birth is related to the target child). For further
discussion of this transfer category, which is the newest and least populated of those analyzed in
this review, see Cho, Neely, Brennan, Vitrano, and Crocco (2016); Coppens et al. (2016); and
Other transfer contexts. Besides the six major categories, the literature also contains
studies exploring other types of transfer, but in all such cases there are currently too few papers
to include in the present meta-analyses (however, in several of these studies, one or more of the
aforementioned transfer categories was also explored; those results are included in the meta-
analyses). There are two categories with two or more articles: transfer of category learning
(wherein subjects learn to classify visually presented category exemplars; e.g., Baghdady,
Carnahan, Lam, & Woods, 2014; Jacoby, Wahleim, & Coane, 2010), and transfer of visuospatial
learning (wherein subjects recall locations and/or make route or directional judgments; e.g.,
Carpenter & Kelly, 2012; Rohrer et al., 2010, Experiment 2). Other studies are the first and, as
of this writing, only investigation of yet other types of transfer (e.g., Kang, McDaniel, & Pashler,
2011, which involved training on mathematical functions; see also George & Wiley, 2016,
involving analogical transfer). Finally, there are additional transfer contexts (e.g., changes in
social contexts, as in individuals vs. groups) that have yet to be investigated in this literature to
date.
Various candidate moderators of testing and/or transfer effects have been catalogued in
the test-enhanced learning and broader transfer literatures. In the test-enhanced learning
literature, these fall into three categories: (a) encoding factors (e.g., the number of training trials
per item, the presence or absence of feedback, the type of feedback provided, the initial test
TESTING AND TRANSFER 28
format, and proportion correct on the initial test), (b) retrieval factors (e.g., the types of final test
questions), and (c) other design variables (e.g., the length of the retention interval between the
initial and final tests, as well as the type of subject materials that are being learned). Any of
these factors may influence test-enhanced learning, and by extension may also influence testing’s
ability to yield positive transfer. With regard to (a), the use of increased training trials, correct
answer feedback (and especially more extensive feedback, such as feedback containing
explanations), more difficult initial test formats (e.g., cued recall rather than recognition), and
relatively high initial test performance (e.g., > 0.50 proportion correct) have been associated with
larger testing effects; many of these factors have also been hypothesized to improve transfer (for
data and discussions see Butler, Godbole, & Marsh, 2013; Dunlosky et al., 2013; Goode, Geraci,
& Roediger, 2008; Jensen, McDaniel, Woodard, & Kummer, 2015; Karpicke & Aue, 2015;
McDaniel & Masson, 1985; McDaniel, Thomas, et al., 2013; McDaniel, Wildman, et al., 2012;
and Rowland, 2014). With regard to (b), the use of more difficult final test formats (e.g., cued
recall rather than recognition) can yield larger testing effects (Halamish & Bjork, 2011;
Rowland, 2014); corresponding effects on transfer (as well as those of other retrieval factors)
have yet to be thoroughly investigated. Regarding (c), the testing effect tends to be larger at
retention intervals of one day or more relative to shorter intervals (Rowland, 2014), a pattern that
may also hold for transfer, whereas the role of subject materials on testing or transfer effects has
been the subject of differing hypotheses (e.g., Karpicke & Aue, 2015; Pan, Gopal, et al., 2015;
Holyoak (1987; see also Barnett & Ceci, 2002; Brooks & Dansereau; Haskell, 2001; McGeoch,
1942; Perkins & Solomon, 1994; Singley & Anderson, 1989) concluded that four types of factors
moderate transfer: (a) the structure of the training and transfer tasks (e.g., the type of knowledge
TESTING AND TRANSFER 29
that needs to be learned and how similar the tasks are to one another), (b) encoding factors (e.g.,
the number and variability of examples provided during training, amount of training, types of
instructions given during training, and degree of abstract learning during training), (c) retrieval
factors (e.g., whether learners are informed of the transfer context, the similarity of the transfer
cues to those seen during training, and the similarity of responses on the transfer task to those
made during training, and (d) prior knowledge and other pre-experimental factors. With regard
to (a), the more structurally similar the training and transfer tasks are, the more likely transfer is
expected (Haskell, 2001), although transfer may also generally vary by knowledge type (Healy,
2007). With regard to (b), an increased number and variety of examples, more training,
instructions to learn underlying principles, and more abstract learning have all been associated
with improved transfer (Gick & Holyoak, 1980, 1987; Haskell, 2001; Perkins & Solomon, 1994).
Regarding (c), increased similarity in cues and/or responses between the training and transfer
tasks (Wylie, 1919; Osgood, 1949), as well as the provision of hints (Gick & Holyoak, 1980,
1987), have been associated with improved transfer. Regarding (d), if relevant to the transfer
context, prior expertise may also increase transfer (Gick & Holyoak, 1987; Haskell, 2001).
literatures and our observation of potentially important design factors during the literature review
process, each effect size included in this review was coded with respect to seven potentially
applicable and analyzable candidate moderators. These candidate moderators were: (a) between-
vs. within-subjects design, (b) number of training phase item repetitions, (c) initial test
performance (i.e., proportion correct during the training phase), (d) retention interval, (e) correct
answer feedback, (f) elaborated retrieval practice, and (g) response congruency. In some cases,
other previously hypothesized moderators could not be analyzed due to their being too rarely or
not at all investigated in this literature. Each candidate moderator was coded by the authors; for
TESTING AND TRANSFER 30
purposes of intercoder agreement and verifying accuracy, a subset of papers was also separately
coded a second time by a trained research assistant. For candidate moderators (a) to (e), all
published articles were coded a second time; for candidate moderators (f) and (g), a randomly
selected 25% of the overall dataset (corresponding to 48 effect sizes; cf. Bujang & Baharum,
2017) was coded again. Any discrepancies between raters were resolved by discussion and
Each of the seven candidate moderators was investigated in the overall meta-analyses.
Where there were sufficient data to do so, they were also investigated in the category-level meta-
Between- vs. within-subjects design. With respect to training condition (e.g., testing vs.
non-testing reexposure control), each study was coded as using a between- or within-subjects
design. In the test-enhanced learning literature, between-subjects designs typically yield larger
effect sizes (Rowland, 2014), although that result has not always been obtained when between-
vs. within-subjects group assignment has been manipulated within a single experiment (e.g.,
Huff, Balota, & Hutchison, 2014; Rowland, Littrell-Baez, Sensenig, & DeLosh, 2014;
Number of training phase item repetitions. Each study was coded for the number of
repetitions of each item during training, ranging from one to five, as a continuous variable. In all
meta-analyzed studies, the number of item repetitions in the testing and non-testing reexposure
control conditions were identical. Some researchers have emphasized that repeated testing is an
important factor in maximizing the benefits of test-enhanced learning and yielding transfer (e.g.,
McDaniel, Thomas, et al., 2013; McDaniel, Wildman, et al., 2012). Intercoder agreement
Initial test performance. Where reported, initial test (i.e., training test) proportion
TESTING AND TRANSFER 31
correct data was recorded as a continuous variable. Initial test performance, particularly in the
absence of correct answer feedback, has been suggested as a moderator of test-enhanced learning
(e.g., Kang et al., 2007; Rowland, 2014; Smith & Karpicke, 2014). Where multiple initial test
repetitions were administered, proportion correct data from the last of those tests was included in
the analyses. Initial test performance was the only candidate moderator for which there was
missing data (i.e., for 30% of included effect sizes, that data was not collected or reported).
Retention interval. The length of time between the end of training and the final test, in
hrs,, was recorded for each study as a continuous variable. In the test-enhanced learning
literature, the magnitude of the testing effect tends to become larger as the duration of the
retention interval increases (e.g., Carpenter, Pashler, et al., 2008; Roediger & Karpicke, 2006),
with retention intervals longer than one day often yielding larger testing effects than retention
intervals that are shorter than one day (Rowland, 2014). Intercoder agreement (proportion of
Correct answer feedback. Each study was coded for the presence or absence of correct
answer feedback during training. All experiments in which subjects were able to view the
correct answers to initial test questions shortly after answering them (i.e., in the same training
session) were coded (as the value 1) as providing correct answer feedback (Rowland, 2014).
Cases with no feedback were coded with the value of zero. Feedback that did not include
exposure to all the correct answers was coded as no feedback, of which there were two types: (a)
feedback involving the number of questions scored correctly out of the total number of questions
(as in Meyer & Logan, 2013), and (b) feedback provided during an instructor-led brief discussion
session that was general in nature and did not specifically address individual subjects’ responses
(as described in Kromann et al., 2009; Kromann, Jensen, & Ringsted, 2010; Kromann,
TESTING AND TRANSFER 32
Bohnstedt, Jensen, Ringsted, 2010). One study in which data from feedback and no feedback
conditions were collapsed together (Butler & Roediger, 2007) was excluded from analyses
involving correct answer feedback. The emphasis on the correct answers being presented during
feedback is due to the fact that feedback lacking such information is often no better than no
feedback at all (e.g., Anderson, Kulhavy & Anders, 1971; Kulhavy & Anderson, 1972; Pashler,
Cepeda, Wixted, & Rohrer, 2005; for an exception, see Butler, Karpicke, & Roediger, 2008).
Elaborated retrieval practice. Each study was coded for the presence or absence of
broad encoding methods and elaborative feedback, respectively – that have been previously been
hypothesized in the literature to enhance transfer. Studies in which either type of training
manipulation was utilized, or both (e.g., Little, 2011, Experiment 5, and McDaniel et al., 2015),
were classified as using elaborated retrieval practice. We originally planned on analyzing both
broad encoding methods and elaborative feedback separately, but report analyses of the two in
confused with elaborative retrieval hypothesis) – due to insufficient data at most category levels
when those components were fitted separately. If broad encoding, elaborative feedback, or both
(as defined below) was present, this variable was coded as 1; otherwise it was coded as zero.
Broad encoding methods. Each study was coded for the presence or absence of initial
cued recall or multiple-choice tests that directed subjects to specifically think about additional
information (e.g., content related to the tested concept or target item) or to retrieve multiple
pieces of information that pertain to a given concept or target item while making a response or
responses. This contrasts with far more common cued recall and multiple-choice initial tests
TESTING AND TRANSFER 33
which involve retrieval of (or recognition of) a single response for a given concept or target item
and/or do not directly specify the consideration of additional information while making
responses. The majority of included studies did not feature broad encoding methods, but several
(specific conditions identified in Table 1) involved one or more of the following four techniques:
1. Broad retrieval instructions – recalling any and all related information that was presented
during the initial study phase before responding (e.g., “think of everything you can recall
that is possibly related to the answer”; as occurred in Chan, et al., 2006; Experiment 3).
answer options prior to selecting an answer (as occurred in Little, 2011; Experiment 5;
words, while responding on an initial test (as occurred in Hinze et al., 2013, Experiment
4. Use of high and low order questions – answering multiple questions for a given concept
or fact, with those questions not just involving pure recall, but also involving higher order
cognitive processes (to use the terminology of Bloom’s taxonomy, namely questions
requiring the learner to evaluate, analyze, or synthesize; cf. Anderson & Krathwohl,
2001; Bloom, 1956, 1984). An example is McDaniel, Bugg, et al. (2015), in which each
concept was trained with application questions and term retrieval questions.
transfer can be found in Chan et al. (2006); Hinze et al. (2013); Jensen et al. (2014); Little
(2011); Nguyen and McDaniel (2016); and Pan and Rickard (2017).
We acknowledge that the definition of broad encoding methods presented here may not
receive total agreement from readers. In particular, free recall tests might be argued as a method
TESTING AND TRANSFER 34
that also induces broad encoding of to-be-learned materials. However, due to the ambiguity that
results from having completely open-ended answers, the production of which may or may not
involve processing of information that is later relevant to a transfer test, studies using free recall
were not coded as involving broad encoding methods. Accordingly, it should be emphasized that
any conclusions that can be drawn from the current meta-analyses regarding broad encoding
Elaborative feedback. Each study was coded for the presence or absence of post-retrieval
activities that, beyond the processing of brief correct answer feedback, enabled subjects to
extensively restudy target materials or information that would later be relevant to correct
conditions are identified in Table 1) used one (or more) of the following three methods:
includes re-reading of entire text passages (e.g., McDaniel, Howard, et al., 2009;
Wooldridge et al., 2014; Zhou et al., 2013) or extensive review in preparation for an
period in Balch, 1998). These methods have been, by far, the most common
2. Explanatory feedback. Feedback that, beyond simply providing a brief (e.g., one word or
short phrase) correct answer, contains an explanation of the answer, an explanation of the
underlying concept, and/or reasons why it is correct; that feedback (usually comprised of
several sentences) would later be relevant on a transfer test (e.g., the conceptual questions
3. Extended and detailed feedback – feedback that could be repeatedly viewed after testing
for an unlimited period of time, including test questions, responses, and correct answers
TESTING AND TRANSFER 35
(e.g., McDaniel, Anderson, et al., 2007; students could review that feedback for a week),
as well as feedback that includes all target materials (e.g., complete sets of premises used
transfer can be found in Butler et al. (2013); McDaniel, Howard, et al. (2009); McDaniel and
Little (in press); Pan, Gopal, et al. (2015); Pan and Rickard (2017); and van Eersel et al. (2016).
Coding of studies for elaborative feedback was performed through inspection of article
methods sections and, in some cases, experimental materials. We acknowledge that the
definition of elaborative feedback presented here may not fully align with some readers’ views
(for related discussions of feedback types, see Butler et al., 2013; Kulhavy & Stock, 1989).
Ultimately, as with broad encoding methods, the classification of studies for elaborative
feedback reflected the available evidence and (in cases of ambiguity) our judgment (with both
authors arriving at mutual agreement). Any conclusions about elaborative feedback apply only
It should also be noted that elaborative feedback was provided in 17 of 23 included effect
sizes involving initial free recall tests (in the form of post-retrieval restudy opportunities, a
common implementation of which involved a free recall attempt, a restudy period, and then a
second free recall attempt). Thus, the coding of studies for elaborative feedback (and hence
elaborated retrieval practice) encompassed the majority of studies in the literature featuring free
recall on the initial test, plus addressed the common training method of free recall testing
Response congruency. Each study was dichotomously coded for the presence or absence
of correct answer response congruency on the initial and final test. Response congruency was
defined as having the same or very substantially overlapping answers. We classified studies
TESTING AND TRANSFER 36
based on descriptions or examples of the materials and methods used for initial and final tests (as
available in the source articles). When performing this classification, we used a stringent
criterion wherein ambiguous cases were treated as not congruent; only unambiguous and obvious
cases where the same or substantially overlapping answers were present on initial and final tests
were classified as having response congruency. Because of this stringent criterion, any
significant effects for response congruency that may be observed in the current meta-analyses
included (a) the majority of studies involving transfer across test formats (i.e., all studies in the
cued recall to free recall, cued recall to multiple-choice, and multiple-choice to cued recall sub-
categories), (b) all studies involving transfer to mediator and related word cues, and (c) a sub-
category of transfer of problem-solving skills, namely medical diagnosis and treatment (wherein
initial and final tests involve scenarios and procedures that are “essentially the same”; per
Kromann et al., 2009, p. 23). Together, these studies accounted for 40% of all effect sizes in the
correct responses on the final test were, by definition, different from the correct responses on the
initial test. Similarly, there was minimal-to-no response congruency for studies of transfer to
untested materials seen during initial study, in nearly all studies involving transfer to application
and inference questions, and in all studies in the worked examples sub-category of transfer of
problem-solving skills. Exceptions in the application and inference category were McDaniel,
3
Having strong response congruency might seem to be antithetical to the definition of transfer (i.e., the same
response does not constitute a new context). However, as an example, one could be asked a definitional question
(e.g., “The degree to which a measurement or a test is consistent is called…?”, for which the answer is reliability),
and later be asked to provide the same response for an application question (e.g., Jon weighed a stone on the same
scale three times and obtained different readings each time; the scale lacks…?), which is a case of the same
response used in a different context (i.e., a clear case of transfer). There are a variety of such cases.
TESTING AND TRANSFER 37
Bugg et al. (2016) and Nguyen and McDaniel (2016; Experiment 1), in which the correct
responses to final transfer test questions were identical to correct responses during training.
For transfer across test formats, the sub-categories involving free recall on the initial test
were regarded as having ambiguous response congruency (due to the fact that information that is
retrieved on a free recall test may or may not match the answers on a subsequent test that
involves more precisely specified cues and responses), and hence were coded as having no
congruency. Similarly, all studies in the cued recall to recognition sub-category were rated as
having no response congruency, with the exception of Carpenter (2011, Experiment 1). In that
sub-category, the recognition test typically involves making old/new judgments to previously
seen (i.e., old items) and new (i.e., lure items) stimuli; the previously seen stimuli have strong
response congruency, whereas new stimuli do not (final test data was separately reported for old
items only in the case of the Carpenter study; those particular results were coded as having
response congruency).
moderators, all involving comparison of sub-categories within a single category, were also
analyzed due to prior treatment or speculation in the literature. These were: free recall vs. not
free recall on the initial test and multiple-choice vs. not multiple-choice on the initial test (test
category); and worked examples vs. medical diagnosis and treatment (problem-solving skills
category).
All formal meta-analyses were conducted using a standardized effect size, Cohen’s d.
Each effect size was computed as the mean difference over subjects in proportion correct
between the two final test conditions of interest (transfer condition minus non-testing reexposure
TESTING AND TRANSFER 38
control condition) divided by the standard deviation of that mean difference. All effect sizes
were calculated from a reported or derivable t statistic or a single degree of freedom F statistic,
and a reported or derivable sample size or degrees of freedom. For between-subjects designs, t
values were converted into d using the equation (Glass, McGaw, & Smith, 1981):
n nR
d t T , (1)
nT n R
where nT refers to the sample size in the test condition and nR refers to the sample size in the
non-testing reexposure control condition. For between-subjects designs in which the results of a
statistical test of interest was not reported, but in which group-level means, their associated
standard errors or deviations, and sample sizes were available or derivable from graphically
reported information (see Pan & Rickard, 2015, for a description of the method of extracting data
values from graphically presented information), the t statistic was calculated from those data
For within-subjects designs, t values were converted to d using the equation (from
2 (1 r ) , (2)
d t
n
where n is the sample size and r is the estimated correlation of final test performance in the
testing and non-testing reexposure control conditions. Following Dunlap et al. (1996) and
Rowland (2014), the unknown value of r was set to .5 with the expectation of making within-
and between-subjects effect sizes roughly comparable, thus reducing Equation 2 to:
t . (3)
d
n
All effect sizes were computed using Equations 1 or 3 regardless of whether an effect
TESTING AND TRANSFER 39
size was directly reported in the article text. We elected to do so for consistency and accuracy,
given that (a) less than one quarter of included studies reported any effect size information, and
(b) not all of the reported effect sizes matched those that were calculated using the above
methods (our calculations produced effect sizes that matched within d = 0.05 for 30 of the 50
reported effect sizes). Given the long history in psychology of statistical training on t and F
statistics, effect size estimates based on those statistics may be more accurate than those reported
The effect size sampling variability (sv) was calculated from equations specified in
Morris and DeShon (2002). For within-subjects designs, sv was calculated using the equation:
1 n 1 d2
sv 1 nd 2
2 , (4)
n n 3 c
where c refers to the bias function in Hedges (1982) that is calculated using the equation:
3
c 1 , (5)
4 df 1
with df referring to degrees of freedom. For between-subjects designs, sv was calculated using
1 N 2 d2
sv 1 ñd 2
2 , (6)
ñ N 4 c
Ninety-five percent confidence intervals for each effect size in the forest plots of Figure 2
were computed using IBM SPSS Statistics (International Business Machines Corp., Armonk,
NY) software and Smithson’s (2003) publicly available noncentral t scripts. This method uses
the software’s noncentral t calculator and Laubscher's (1960) normal approximation method
(Wuensch, 2012).
TESTING AND TRANSFER 40
Raudenbush, 2009) was one of two main approaches employed for quantitative meta-analyses.
Two random effects, study and experiment (within study), were estimated hierarchically using
where Tij is the estimated effect size for group i in study j, Xij is the design matrix in study j, β is
the vector of regression coefficients, θi is the study-level random effect, ηij is the group-level
In random-effects meta-analysis the observed effect size at each level of the hierarchy
(i.e., for each study and each group within study) is treated as a random deviate from its own
population effect size distribution. The degree to which the effect sizes are in fact heterogeneous
(i.e., random deviates from different distributions) vs. homogeneous (i.e., random deviates from
the same distribution) can be quantified, both prior to and after fitting candidate moderator
variables. In the current model, the residual variation of the effect size estimate Tij can be
decomposed as:
where τ2 is the variance of the between-study residuals, θi, and ω2 is the variance of the within-
study residuals, ηij, and υij is the known sampling variability of each group. Estimates of τ2 and
(or) ω2 that are greater than zero suggest that heterogeneity is present and that fixed-effects
moderator variables (i.e., meta-regression) may help explain differences in effect sizes over
Given that the covariance structure of the effect size estimates is unknown in the transfer
TESTING AND TRANSFER 41
of test-enhanced learning literature, we employed robust variance estimation (Hedges, Tipton, &
Johnson, 2010; Tanner-Smith & Tipton, 2013) in the model fitting. All meta-analyses were
performed using Stata (StataCorp LP, College Station, TX, USA) and the macro robumeta.ado,
which can be downloaded from the Stata Statistical Software Components archive (SSC).
experiments within a study (i.e., within a single paper). Accordingly, we used the macro’s
hierarchical weight type option, which accommodates such dependencies (see Tanner-Smith &
Tipton, 2013).
relatively good statistical power and can accommodate dependencies due to nesting of
experiments within paper, it does not adjust for potential publication bias. As supplemental
analyses that can both detect and adjust for publication bias, we used the precision effect
estimate with standard errors (PET-PEESE) method (Stanley & Doucouliagos, 2014) and the
sensitivity analysis with a priori weight functions (Vevea & Woods, 2005) approach.
PET-PEESE analyses. Both the PET and PEESE analysis procedures involve standard,
weighted linear regression in which d is predicted by either the standard error (PET) or the sv
(PEESE), weighted by study precision, 1/sv. Any significant moderator variables that were
identified in the random-effects analysis were also included in the PET-PEESE analyses. The
where β0 refers to the intercept estimate, β1 is the slope estimate, sei is the standard error
for the ith effect size, sv is the corresponding sampling variability, zk is the kth moderator, and αk
(2017) recommends the initial application of Equation 9. If the intercept is positive or is non-
significantly negative at p >.10, then performing the primary analysis using Equation 10 is
recommended. We used that criterion, along with the additional criterion that PEESE was used
if one or more of the moderators that were identified in the random-effects analysis was
significant at the 0.05 level (all dichotomous moderators were coded as in the random-effects
analysis such that the level of the moderator that was expected to yield smaller effect sizes took a
value of zero and the level expected to yield larger effect sizes took a value of 1). By those
criteria, PEESE rather than PET was in all cases indicated, and thus only the PEESE results are
If there is publication bias, then it is expected that there will be a positive slope (β1)
relating d to sv. That effect is expected because studies with low precision will have the highest
variability in effect size estimates, and because in most cases unpublished studies are those with
low precision and small effect size estimates that do not reach the traditional statistical
significance level. Effect size estimates adjusted for publication bias are assessed at the intercept
(including moderator intercepts), corresponding to the hypothetical best case study with zero
sampling variability.
In our view, the hierarchical random-effects and PEESE analyses are complementary.
Accordingly, confidence is highest when an effect that is detected in the former analysis method
approach developed by Vevea and Woods (2005; see also Vevea & Hedges, 1995) to estimate
Vevea and Woods specify a set of fixed weights that can be used to explore four different
scenarios of possible publication bias: the cases of moderate and severe bias for both one- and
two-tailed tests. We conducted sensitivity analyses using Vevea and Woods’ publicly available
each of those four scenarios and using the authors’ example p-value cutoffs of .001, .01, .05, and
.50 (and for the two-tailed cases, also .999, .99, and .95).
Results
The meta-analyses are reported in the following order. First, we report random-effect
meta-analyses on the entire dataset of 192 effect sizes across 67 papers. This includes (a) the
estimated weighted mean effect size for the transfer of test-enhanced learning literature, (b) fits
of each candidate moderator in isolation, and (c) simultaneous fits of candidate moderators (i.e.,
yielding a final model. The latter analysis involved fitting all candidate moderators and then
iteratively eliminating the least significant (i.e., largest p-value) candidate moderator from the
model, one at a time (Van den Bussche, Noortgate, & Reynvoet, 2009), until all remaining
moderators were statistically significant or marginally significant (here and throughout this
were included because it may be of interest to the field to know about factors that may influence
Second, we report the results of PEESE and sensitivity analyses for the entire dataset. In
these analyses, all moderator variables that were identified in the random-effects meta-analyses
TESTING AND TRANSFER 44
were included. Third, we repeat the same analysis sequence, but limited to the 135 effect sizes
for which proportion correct on the initial test was reported (and hence that candidate moderator
could be tested). Initial test performance was the only candidate moderator with substantial
missing data in our dataset. Fourth, we report category-level meta-analyses where possible, each
of which involved the same general analysis sequence as in the overall meta-analyses (i.e.,
random-effects meta-analyses involving steps a, b, and c, after which the results of PEESE and
Overall Meta-Analyses
Across the entire dataset of included studies, the weighted mean effect size in the
random-effects model was d = 0.40 (p < .00001), a medium effect by traditional standards
(Cohen, 1988), with a 95% confidence interval (C.I.) of [0.31, 0.50]. That result supports, for
the first time at the literature level through quantitative means, the general view that test-
enhanced learning can yield transfer (pending tests for publication bias). There was also
evidence of heterogeneity, however, both at the between paper (τ2 = 0.084) and the experiments
within paper (ω2 = 0.050) levels. It is perhaps not surprising that heterogeneity is present in this
analysis, given the wide range of materials and transfer contexts in the literature.
Further insight into the effect size patterns can be gained by visual inspection of the
forest plots in Figure 2 (all panels). Those plots display effect sizes and confidence intervals for
each experiment organized by the six major transfer categories and their sub-categories. In the
context of the overall pattern of positive transfer, there is substantial variability in effect sizes
between both major categories and sub-categories (reflected quantitatively in the τ2 value) and, in
some categories, over effect sizes within paper (reflected in the ω2 value). The weighted mean
effect sizes and confidence intervals for the entire dataset and for each category is shown in
Table 2.
TESTING AND TRANSFER 45
Single and simultaneous moderator fits to the full dataset using random-effects
meta-analysis. The heterogeneity observed in the overall model fit motivates consideration of
Single moderator fits. Results of the single moderator fits to the overall dataset are listed
in Table 3. In those fits, only response congruency was significant (p = .0006). When there was
no response congruency between the initial and final tests, the weighted effect size was d = 0.28,
p < .00001, C.I. [0.17, 0.39]; if response congruency held, the estimated effect size increased by
d = 0.30, p = .0006, C.I. [0.14, 0.47], yielding an estimated effect size of d = 0.58.
Simultaneous moderator fits. Results of the simultaneous moderator fits to the overall
dataset are listed in Table 3. In the final model, the moderators of response congruency and
elaborated retrieval practice were significant (ps ≤ .0094). No other candidate moderators
approached significance in that model. Having the same correct responses on the initial and final
tests yielded a higher estimate of positive transfer (estimated increase of d = 0.35, p = .0002, C.I.
[0.18, 0.51]), as did the use of elaborated retrieval practice (estimated increase of d = 0.22, p =
.0094, C.I. [0.059, 0.38]). When both of those factors were present, the estimated transfer effect
size was d = 0.78. When neither was present, the estimated transfer effect size was d = 0.21.
Between-paper heterogeneity was reduced more than within-paper heterogeneity in that model
Analyses adjusted for publication bias. Results of PEESE analyses to the full dataset are
listed in Table 4 (for completeness, the table includes the cases in which the previously identified
moderators are or are not included, although we focus on the results for the former case). There
was a highly significant effect (p < .0001) of sv, suggesting publication bias. Moreover, unlike
TESTING AND TRANSFER 46
the random-effects analyses, the intercept, representing the estimated effect size when neither
moderator effect is present, is effectively zero. However, the effect size estimates for the
moderator variables are highly consistent with the random-effects analyses discussed earlier.
When response congruency is present, the estimated increase in d is 0.36, and when elaborated
retrieval practice is present, the estimated increase in d is 0.18. Thus, there is no evidence that
the moderator results in the final random-effects model were meaningfully contaminated by
publication bias.
Results of sensitivity analyses using various selection methods are listed in Table 5.
Under all four scenarios of potential publication bias (moderate and severe one- and two-tailed
publication bias, respectively), the effect size estimates for the two moderators were similar to or
larger than those observed in the random-effects and PEESE analyses. The intercept was more
variable under the four scenarios and became negative in the case of severe one-tailed
publication bias. It is also notable that the intercept in the PEESE analysis to the overall dataset
with moderators included (d = 0.015) falls in between the estimated intercepts in the moderate
one-tailed (d = 0.12) and severe one-tailed (d = ─0.12) sensitivity analyses. In our estimation,
one-tailed publication bias (i.e., publication bias that obscures cases of non-significant and
In the random-effects analysis, the weighted mean effect size for studies in which
proportion correct on the initial test was reported (k = 135 effect sizes; 30% of effect sizes
lacking such data were excluded) was d = 0.41, p < .00001, C.I. [0.30, 0.52], which is nearly
identical to that of the full dataset. Between-paper and within-paper heterogeneity was also
TESTING AND TRANSFER 47
Single and simultaneous moderator fits to studies with initial test performance data
using random-effects meta-analysis. Results of single and simultaneous moderator fits to the
135 effect sizes with initial test data are listed in Table 3.
Single moderator fits. In the single moderator fits (Table 3), the number of training
phase repetitions, response congruency, and initial test performance moderators were significant
(ps ≤ .0048). The results for response congruency matched those in the overall meta-analyses.
The results for the number of training phase repetitions moderator suggests that with each added
repetition during training, the proportion correct transfer effect size increases by an estimated d =
0.13, p = .0011, C.I. [0.060, 0.20]. However, as described below, that moderator did not survive
Simultaneous moderator fits. In the final model (Table 3), there were three robust
moderators: response congruency and elaborated retrieval practice (as in the overall fits to the
full dataset), plus initial test performance (ps ≤ .015). Between-paper heterogeneity was
substantially reduced (from τ2 = .077 in the model fit with no moderators to τ2 = 0.028), although
within-paper heterogeneity was not (ω2 = 0.061). Having achieved a higher proportion correct
on the initial test was associated with a greater likelihood of transfer – as was the presence of
response congruency and the use of elaborated retrieval practice. With regard to initial test
performance, the proportion correct transfer effect size increased (∆d) by an estimated 0.0058 for
every increment of 0.01 in initial test proportion correct. Across the full observed range of
observed initial test proportion correct in the sample (0.19 to 0.98), the total ∆d was 0.46.
Analyses adjusted for publication bias. Results of PEESE analyses for effect sizes with
initial test performance data are listed in Table 4. In the analyses with moderators fitted, sv was
again a highly significant moderator (p = .0002), suggesting substantial publication bias. The
TESTING AND TRANSFER 48
intercept in this case was negative, suggesting negative transfer in the absence of response
congruency and elaborated retrieval practice, and when proportion correct on the initial test is
zero. The moderator effect sizes were again analogous to those of the random-effects analyses;
the parameter estimates for response congruency and initial test accuracy (∆d = 0.25 and 0.50,
respectively) were only marginally smaller than in the random-effects analysis (compare to Table
3) and they remained highly significant. The estimate for elaborated retrieval practice was
Results of the sensitivity analyses are listed in Table 5. Again, the moderator effect sizes
were not reduced under different scenarios of publication bias. As with the sensitivity analyses
to the overall dataset, the estimated intercept was more variable under the different scenarios,
especially for the case of one-tailed publication bias; the intercept for the case of moderate bias
(d = ─0.26) most closely resembles the intercept in the PEESE analyses (d = ─0.30) of the initial
test performance dataset. This result reinforces our earlier inference of at least moderate
Category-Level Meta-Analyses
Results of category-level single and simultaneous moderator fits are described as follows.
29 included studies in the test format category (comprising six transfer sub-categories involving
different combinations of initial and final test formats) yielded a weighted mean effect size of d =
0.58 (a medium-large effect), p < .00001, C.I. [0.43, 0.73]. This result confirms the conclusion
in the literature of positive transfer across test formats (that is also evident in the forest plot in
Figure 2, panel a). Heterogeneity remained, however, at both the between paper (τ2 = 0.056)
Single and simultaneous moderator fits. We first tested the following five candidate
TESTING AND TRANSFER 49
answer feedback, response congruency, number of training phase item repetitions, and retention
interval. Additionally, prompted by hypotheses in the literature about the relative effectiveness
of different initial test formats (e.g., McDaniel, McDermott, et al., 2007), we examined two
dichotomous candidate moderators involving initial test format: free recall vs. not free recall on
the initial test and multiple-choice vs. not multiple-choice on the initial test. There were too few
studies involving elaborated retrieval practice for that candidate moderator to be analyzed. In the
single moderator fits (Table 6), only the between- vs. within-subjects moderator was significant.
For within-subjects designs, the weighted effect size was d = 0.49, p < .00001, C.I. [0.33, 0.66];
the use of a between-subjects design increased the estimated effect size by d = 0.35, p = .0062,
C.I. [0.12, 0.58], yielding an estimated effect size of d = 0.84. In the simultaneous fits (see Table
6), the between- vs. within-subjects moderator was again the only significant moderator.
We next performed meta-analyses on the 46 (out of 56) effect sizes in the test format
category for which initial test performance data were available (Table 6). When initial test
performance was tested as a single candidate moderator, it was not significant on its own (p =
.12). However, when initial test performance was simultaneously fit to the data along with the
aforementioned seven candidate moderators, it was significant along with the between- vs.
within-subjects design, response congruency, and multiple-choice vs. not multiple-choice on the
initial test moderators (ps ≤ .048). Between-paper heterogeneity was reduced to zero in that final
model (τ2 = zero), whereas within paper heterogeneity remained high (ω2 = 0.11). The findings
for initial test performance and response congruency were also consistent with patterns observed
Analyses adjusted for publication bias. In PEESE analyses to the entire test format
TESTING AND TRANSFER 50
category dataset with moderators fitted (Table 7; results of analyses with no moderators also
shown), sv was not a significant moderator of effect size (p = .29). Moreover, the intercept effect
size (d = 0.39) was similar to that in the corresponding random-effects analysis (d = 0.49). The
between- vs. within-subjects design moderator effect also remained positive but was no longer
In PEESE analyses with moderators limited to data for which initial test performance was
reported, sv was a significant moderator of effect size (p = .025), suggesting publication bias.
However, the moderators of response congruency, initial test accuracy, and multiple choice vs.
not on the initial test (that were identified in the random-effects analysis) all remained potent.
The between- vs. within-subjects design moderator was again not significant. Because that
moderator did not emerge from the overall analysis and was not identified in any other category
analyses, we infer that it has a weak effect on transfer at best (in contrast to its apparent influence
In sensitivity analyses to the entire test format category dataset, the intercept estimate was
generally unaffected by the different scenarios of publication bias (Table 8). The sole exception
was the case of severe one-tailed publication bias, in which the intercept was reduced by more
than half but remained positive. Similarly, sensitivity analyses limited to data for which initial
test performance was reported revealed only modest changes to effect size estimates for both the
moderators and the intercept, again excepting the case of severe one-tailed publication bias.
from all 10 studies in the stimulus-response rearrangement category (including four sub-
categories of stimulus types: paired associates, triple associates, multi-term facts, and term-
TESTING AND TRANSFER 51
definition facts) with no candidate moderators fitted yielded a weighted mean effect size of d =
0.22 (a small effect), p = .066, C.I. [─0.019, 0.45], suggesting weak transfer at best. Between-
paper heterogeneity was again observed, however (τ2 = 0.071, ω2 = zero). Inspection of the
forest plot in panel b of Figure 2 reveals a more nuanced pattern: positive transfer for paired
associates and minimal transfer elsewhere (as observed by Pan & Rickard, 2017).
Single and simultaneous moderator fits. We tested the following three candidate
retention interval, and paired associates vs. non-paired associates. The values of other
aforementioned candidate moderators exhibited minimal variability across experiments and were
not analyzed. In the single moderator fits (Table 6), only the paired associates vs. non-paired
associates moderator was significant. For non-paired associate stimuli, the weighted effect size
was negligible, d = 0.063, p = .29, C.I. [─0.078, 0.20]; for paired associates, the estimated effect
size increased by d = 0.66, p = .0031, C.I. [0.46, 0.86], yielding a large estimated effect size of d
= 0.72. Although the p-value for paired associates should be treated with some caution given
insufficient degrees of freedom for that category at the paper level, its small value, in
combination with consistent results over studies (see Figure 2, panel b), gives us high subjective
confidence in the conclusion of strong transfer for that case. When all three candidate
moderators were evaluated simultaneously (see Table 6), only the paired associates vs. non-
paired associates moderator survived. Heterogeneity was reduced to near zero in that final
rearrangement category that included the paired associates vs. non-paired associates moderator
(Table 7), sv was marginally significant (p = .064). The intercept estimate of roughly zero for
non-paired associate stimuli corresponds closely to the random-effects analysis, as does the
TESTING AND TRANSFER 52
estimated effect size increment of d = 0.59 for paired associate stimuli. In the sensitivity
analyses, the effect size estimate for the paired associate vs. non-paired associate moderator, as
well as the intercept, exhibited only modest fluctuations under the four scenarios of publication
effect sizes from the 12 studies in the untested materials category (with no candidate moderators
fitted) yielded a weighted mean effect size of d = 0.16, p = .20, C.I. [─0.10, 0.43]. Thus, there is
no compelling evidence of transfer for this category. There was however relatively high
between-paper heterogeneity (τ2 = 0.11), but zero within-paper heterogeneity. The heterogeneity
between papers is evident upon inspection of the forest plot in panel b of Figure 2.
Single and simultaneous moderator fits. We tested the following five candidate
design, number of training phase item repetitions, correct answer feedback, retention interval,
and elaborated retrieval practice. In the single moderator fits (Table 6), only the elaborated
retrieval practice moderator was significant, d = 0.37, p = .032, C.I. [0.041, 0.70]; the use of
such strategies increased the estimated effect size from d = 0.0028 to 0.37. Between-paper
heterogeneity was modestly reduced in that moderator fit (τ2 = 0.077, ω2 = zero). The results for
that moderator reflect corresponding findings in the overall meta-analyses, as well as prior
suggestions in the literature for this transfer category (e.g., Balch, 1998; Chan et al., 2006; Little,
2011; Hinze et al., 2013; see also Chan, 2009; 2010; Little & Bjork, 2015; Little, Bjork, Bjork, &
Angello, 2012). When all five candidate moderator were evaluated simultaneously (see Table 6),
Analyses adjusted for publication bias. In PEESE analyses that included the elaborated
retrieval practice moderator (Table 7), sv was a significant moderator (p = .028), suggesting
TESTING AND TRANSFER 53
publication bias in the untested materials category. The adjusted intercept estimate was negative
(d = ─0.32), suggesting that transfer to untested materials is worse in the testing than non-testing
reexposure control conditions (in contrast, in the random-effects analyses, the intercept estimate
was essentially zero). The estimated effect size increase for elaborated retrieval practice (d =
0.34) was significant and, in contrast with nearly every other PEESE analysis performed, larger
than in the random-effects model. However, given the negative intercept estimate of similar
magnitude, there is by this analysis no evidence of positive transfer in this category relative to a
non-testing reexposure control, even when elaborated retrieval practice is present. It thus
appears that that testing with elaborated retrieval practice yields learning in this category that is
The sensitivity analyses (Table 8) converge with the conclusions based on the PEESE
analysis. The effect size estimate for elaborated retrieval practice was minimally affected by the
the 17 papers in the application and inference category (including three sub-categories:
application questions, inference questions, or both) with no candidate moderators fitted yielded a
weighted mean effect size of d = 0.32, p = .0013, C.I. [0.085, 0.56], indicating overall positive
transfer. However, there was substantial between-paper (τ2 = 0.11) but zero within-paper
heterogeneity.
Single and simultaneous moderator fits. We tested the following five candidate
moderators using random-effects meta-analysis, the values of which varied sufficiently over
experiments: between- vs. within-subjects design, retention interval, correct answer feedback,
and elaborated retrieval practice. No sub-category comparisons (e.g., application vs. inference
questions) were performed due to the limited number of papers in each sub-category; moreover,
TESTING AND TRANSFER 54
visual inspection of the category-level forest plot suggests that the variation in effect sizes in this
category is not attributable to sub-category. In the single moderator fits (Table 6), the number of
training phase item repetitions, retention interval, and elaborated retrieval practice moderators
were significant (ps ≤ .029). An increase in training repetitions or retention interval was
associated with improved transfer (estimated increases of d = 0.33, p = .016, C.I. [0.11, 0.55] and
0.0033, C.I. [0.0010, 0.0054] for each additional repetition and added hour, respectively), as was
the use of elaborated retrieval practice (estimated increase of d = 0.35, p = .029, C.I. [0.046,
0.66]). When all five moderators were evaluated simultaneously (see Table 6), the correct
answer feedback moderator was significant (p = .011), and the number of training phase
repetitions and elaborated retrieval practice moderators were marginally significant (ps ≤ .063).
However, the p-value for correct answer feedback and the number of training phase repetitions
should be treated with caution due to insufficient degrees of freedom. Correct answer feedback
was associated with less transfer (estimated decrease of d = ─0.49, C.I. [─0.75, ─0.24]),
yielding an estimated effect size of d = ─0.049, whereas elaborated retrieval practice was
associated with the reverse (estimated increase of d = 0.26, C.I. [─0.019, 0.54]), yielding an
estimated effect size of d = 0.70. Between-paper heterogeneity was substantially reduced in that
Analyses adjusted for publication bias. In the PEESE analyses of the application and
inference category dataset with moderators fitted (Table 7), all three of the moderating variables
that were identified in the random-effects analysis survived with similar effect size estimates.
Although there was a trend toward publication bias as indicated by the estimated coefficient for
sv, it did not reach statistical significance (p = .11). In sensitivity analyses to the same dataset
(Table 8), the effect size estimates for all moderators and the intercept were minimally affected
papers that comprise the problem-solving skills category (including two sub-categories: medical
diagnosis and treatment and worked examples) with no candidate moderators fitted yielded a
weighted mean effect size of d = 0.29, p = .10, C.I. [─0.078, 0.65], indicating weak transfer.
However, as is confirmed upon inspection of the forest plot in Figure 2, panel d, as well as by the
difference between the results for the medical diagnosis and treatment sub-category and those for
Single and simultaneous moderator fits. The difference between the two sub-categories
examples vs. medical diagnosis and treatment as a single moderator (Table 6), d = 0.59, p = .028,
C.I. [0.093, 1.09]. The use of problem types which involve medical diagnosis and treatment
increased the estimated effect size from d = 0.045 to d = 0.59. Between-paper heterogeneity
was substantially reduced in that moderator fit (τ2 = 0.047, ω2 = zero). None of four other
candidate moderators fitted in this case (between- vs. within-subjects design, correct answer
feedback, number of training phase item repetitions, and retention interval) approached statistical
significance. When all five candidate moderators were fitted simultaneously (see Table 6), again
only the worked examples vs. medical diagnosis and treatment moderator was significant.
category dataset which included the worked examples vs. medical diagnosis and treatment
moderator (Table 7), the results were a close match to that of the random-effects analysis, with
skills dataset (Table 8), the effect size estimate for the worked examples vs. medical diagnosis
and treatment moderator was largely unaffected except for the case of severe one-tailed
TESTING AND TRANSFER 56
publication bias, and the intercept was near zero in all cases.
Mediator and related word cues. A meta-analysis on 27 effect sizes from the five
papers that comprise the mediator and related word cues category yielded a weighted mean effect
size of d = 0.61 (a medium-large effect), p = .018, C.I. [0.25, 0.97], indicating positive transfer
(although the p-value should be treated with caution due to insufficient degrees of freedom at the
paper level). Although inferentially valid moderator fits were not possible due to insufficient
data, we were able to estimate separate weighted mean effect sizes for the two cue types:
mediator cues (d = 0.76); related cues (d = 0.47). That numerical difference is consistent with
paper-level statistical results (e.g., Carpenter, 2011) and is evident upon inspection of panel d of
Figure 2.
Analyses adjusted for publication bias. In PEESE analyses to the mediator and related
word cues dataset (no moderators were analyzed in this category; see Table 7), sv was marginally
significant (p = .070), suggesting possible publication bias, and the adjusted intercept (d = 0.36)
was reduced relative to that of the random-effects analysis. In sensitivity analyses of the same
dataset, the intercept was minimally affected by the different scenarios of publication bias.
In the test-enhanced learning literature, differing predictions about the extent to which
transfer effects may be smaller or larger than testing effects have been made (e.g., Carpenter &
DeLosh, 2006; McDaniel, Anderson, et al., 2007; Rohrer et al., 2010). It has also been an open
question in the literature as to whether testing and transfer effects are correlated with one
another. To address both issues, we plotted testing and transfer effect size data from all 81
experiments in our sample that assessed both effects on the final test (approximately 40% of
effect sizes in our dataset, encompassing five categories, had such information). The mean
testing effect size in our dataset, d = 0.68, is roughly comparable to that observed in prior meta-
TESTING AND TRANSFER 57
analyses (e.g., weighted mean effect sizes of g = 0.70 in Adescope et al., 2017 and g = 0.50 in
Rowland, 2014), although it is important to note that those studies involved different sets of
Results for testing vs. transfer effects are shown in Figure 3. Most data points are below
the diagonal (dotted line) that corresponds to equivalent effect sizes (56 of 81 cases; binomial
test: p = .00070). Hence, on average, transfer effects are smaller than testing effects, a pattern
that may generally be the case in this literature. The only exception was a non-significant trend
An additional and unexpected result evident in Figure 3 is that, in the overall data set,
testing and transfer effect sizes are at best weakly correlated. However, that result may mask
contrasting patterns within category type. We will return to that topic in the Discussion.
Discussion
of test-enhanced learning. In aggregate across that literature, there is substantial evidence that
testing can yield positive transfer relative to non-testing reexposure control conditions such as
restudy and rereading. When considering the fact that transfer is often notoriously difficult to
achieve (Gick & Holyoak, 1987; Haskell, 2001; Singley & Anderson, 1989), and that its very
existence has been debated (e.g., Barnett & Ceci, 2002; Detterman, 1993; Singley & Anderson,
1989), that finding is in itself notable. However, in some categories and sub-categories, weak,
null, or even negative transfer was observed, and particularly in analyses that adjust for
publication bias. Overall across the literature, positive transfer of test-enhanced learning appears
to be strongly conditional on key aspects of performance and experiment design – a finding that
Drawing upon the three major moderators uncovered in the overall meta-analyses, we
transfer manifests as a function of whether there is response congruency between the initial and
final tests, whether elaborated retrieval practice (i.e., broad encoding methods and/or
elaborative feedback) is employed during training, and whether the level of initial test
performance is high or low. The predictions of that framework for the overall dataset wherein
initial test accuracy was reported are depicted graphically in Figure 4, panel a, where the
estimated moderator effects are from the random-effects meta-analysis (PEESE moderator
results, shown in panel b, were roughly equivalent, although the no-moderator intercept was
reduced nearly three-fold, from τ2 = 0.077 to τ2 = 0.028. It thus appears that the major
determinants of transfer across papers are captured by that model. In contrast, within-paper
heterogeneity was virtually unchanged after the model was fitted (ω2 = 0.061). However, that
effect appears to be exclusively driven by the test format category. In the previously reported
random-effects analysis limited to that category, within-paper heterogeneity was very large (ω2 =
0.11), even after moderators were identified; correspondingly, when that category was removed
Under the conditions of no response congruency, no elaborated retrieval practice, and low
4
Inspection of details of the articles in the test format category appears to explain the high value of ω 2. Multiple
papers examined variations within a single test format across multiple experiments (e.g., varying the difficulty of the
cues presented, as in Halamish & Bjork, 2011; Rowland & DeLosh, 2016), as well as other differences in training
phase design (e.g., mixed vs. pure lists in Rowland et al., 2014; different sequences of study and test trials in Jacoby
et al., 2010). Those variables could not be addressed in the present meta-analyses.
TESTING AND TRANSFER 59
initial test proportion correct (i.e., by subtracting 0.19, the lowest value of that initial test
proportion correct in the dataset, from every effect size, yielding a new intercept), the estimated
transfer effect size in the random-effects analysis is near zero, with a confidence interval which
at its upper extreme yields only a small negative transfer effect, d = ─0.053, C.I. [─0.22, 0.12].
In contrast, with both response congruency and the use of elaborated retrieval practice present in
that same model, and with the initial test accuracy intercept set to the maximum value in the
dataset (0.98), the predicted effect size is d = 0.90, C.I. [0.71, 1.08]. These two contrasting cases
Similar results were obtained in the PEESE analyses, although with notably reduced
effect sizes. Under the conditions of no response congruency, no elaborated retrieval practice,
and low initial test performance, the estimated transfer effect size is negative (d = ─0.21). At the
opposite extreme, with response congruency and elaborated retrieval practice present, and initial
test accuracy at the maximum value in the dataset, the predicted effect size is d = 0.58.
In the next sections we consider each of the moderating factors in the three-factor transfer
Response congruency. If strong response congruency holds in a given study, then there
consistent with identical elements and other similarity-based models of transfer (e.g., Thorndike,
1906; for related discussions see Hamaker, 1986; Morris et al., 1977; Roediger & Blaxton, 1987;
Tulving, 1970, 1984). However, its precise definition differs from those used in other accounts
of similarity in which response congruency is not an explicit focus, and which refer more broadly
(and often far less precisely) to semantic or other processing similarities between the training and
transfer contexts (e.g., Anderson, 1993; Bruce, 1933; Haskell, 2001; Morris et al., 1977), or
which propose different mechanisms (e.g., Healy, Wohldmann, & Bourne, 2005). One
TESTING AND TRANSFER 60
prominent exception is Wylie (1919), who proposed that transfer is determined by the “objective
Indeed, the response congruency effect appears in some cases to be in direct contradiction
of the predictions of other similarity-based frameworks. This is illustrated by the following case
of stimulus-response rearrangement involving facts (from Pan, Gopal, et al., 2015): the initial
test involved the question, “Thomas Jefferson purchased Louisiana from WHOM?” and a
subsequent transfer test involved the question, “Thomas Jefferson purchased WHAT from
Spain?” At both surface and semantic levels, there appears to be strong similarity between those
questions, and thus most similarity-based transfer frameworks would appear to predict positive
inspection of differences in the correlations between testing and transfer effects for categories
that did or did not have that property (see Figure 3). Response congruency held for all studies in
the mediator and related word cues categories and for most cases in the test format category, and
across those two cases there were indications of a positive correlation between testing and
transfer effects. In contrast, in the three other categories wherein response congruency generally
Mechanisms of positive transfer via response congruency. There are two theoretical
reasons to expect that response congruency may facilitate transfer. First, if a correct response on
the final test was also retrieved or provided as correct answer feedback on the initial test, then
that response may be more available (i.e., more easily accessible) on the final test than would
otherwise be the case (Bjork & Bjork, 1992; Carrier & Pashler, 1991; Vaughn & Rawson, 2014).
A clear case in which that effect may be at play is the presentation of related cues on the final
test for the same responses that were retrieved on the initial test (e.g., Rawson et al., 2015).
TESTING AND TRANSFER 61
Related cues are by design only weakly related to the correct final test response, and thus may be
unlikely to facilitate retrieval of the correct response by themselves. However, if the correct
response was made more available by the initial test, then the joint factors of that increased
availability and the weak association with the related cue may boost final test performance above
Second, in many of the cases in which response congruency holds, it is also the case that
all or part of the stimulus-to-response pathway that was established on the initial test (at least for
correct trials) can be reinstated to support retrieval of the correct answer on the final test. Four
such cases of stimulus-to-response pathway reinstatement are depicted in Figure 5 (all panels).
Consider first the case of transfer from an initial multiple-choice test to a final cued recall test
(i.e., across test formats), in which the correct responses on the initial and final tests are the
same, and excepting a format change, the initial and final test cues are as well (e.g., Meyer &
Logan, 2013). As shown in panel a, the originally learned stimulus-to-response pathway can be
reinstated (i.e., by the transfer stimulus, which differs from the initial test stimulus only by a
change in the presented test format) to retrieve the correct response on the final test, thus
yielding direct transfer of test-enhanced learning. The same can apply for the reverse case (i.e.,
cued recall to multiple-choice). Another plausible case involves mediator word cues on the final
test (e.g., the aforementioned example of father as a mediator for mother-child). Two scenarios
of stimulus-to-response pathway reinstatement for that case have been advanced in the literature
(Cho et al., 2017; Coppens et al, 2016). In the first, depicted in panel b, the mediator word cue
accesses a mediator-to-target pathway (e.g. father-child) that is presumed to have been formed or
strengthened on the initial test. In the second, depicted in panel c, the mediator word cue (e.g.,
father) activates the initial test word cue (e.g., mother), thus reinstating the full stimulus-to-
response pathway (e.g., father elicits mother, which then elicits the target, child) that was
TESTING AND TRANSFER 62
involving medical diagnosis and treatment. In the former case (e.g., Kromann et al., 2009),
which corresponds to panel a of Figure 5, the patient demographics and/or symptoms on the
initial and final tests may differ slightly, but the overall scenarios and procedures (i.e., a cardiac
resuscitation checklist) are nearly or completely identical. In that case, the cues on the final test
likely can support reinstatement of the full stimulus-to-response pathway. In the latter case (e.g.,
Larsen et al., 2013a; Larsen et al., 2013b), which corresponds to panel d, the stimuli on the
transfer test partially overlap with (or are a subset of) those on the initial test. For example, after
training to recognize and treat different neurological conditions, subjects are presented with
patient scenarios which correspond to those conditions on a final test. In that case, partial
stimulus-to-response pathway reinstatement is likely to occur (i.e., subjects are able to link the
cues that are presented on the transfer test to previously trained symptoms and procedures).
Elaborated retrieval practice. The use of elaborated retrieval practice increases the
likelihood of positive transfer. We have suggested that elaborated retrieval practice takes two
distinct and often non-overlapping forms: broad encoding methods and elaborative feedback.
instructions, explanatory recall, and the combination of high and low order questions can be
efficacious at yielding transfer. Indeed, all nine effect sizes across five meta-analyzed studies in
four categories that involved broad encoding methods yielded statistically significant positive
transfer effects (Chan et al., 2006; Hinze et al., 2013; Larsen et al., 2013b; Little, 2011;
McDaniel, Bugg, et al., 2015). Broad encoding methods may be especially critical for yielding
transfer to untested materials seen during initial study, as such methods likely result in additional
TESTING AND TRANSFER 63
processing of those materials (Anderson & Biddle, 1975; Little, 2011). For instance, in Chan et
al. (2006; Experiment 3), subjects in the broad retrieval condition were instructed to think of all
related information while generating answers to initial test questions; positive transfer to untested
materials was subsequently observed in that condition, but not in other conditions which lacked
such instructions. One candidate underlying mechanism for transfer via broad encoding methods
is greater cognitive processing of initially studied materials (for discussions see Carpenter &
DeLosh, 2006; Chan et al., 2006; Frase, 1968). That added processing may stem from
reactivation, or reminding, of memories formed during initial study that are not limited to the
correct answer but may also involve other aspects of the studied materials (for further discussion
of the effects of reminding, see Jacoby, Wahlheim, & Kelley, 2015; Tullis, Benjamin, & Ross,
2014). Alternatively, more indirect processes (e.g., discrimination between answer choices as
discussed by Little, 2011; improved construction of mental models as suggested by Hinze et al.,
It is important to reemphasize here that only four types of training methods qualified as
broad encoding methods in the meta-analyses, and together these studies only comprised a small
proportion of the literature. The definition of broad encoding methods is open to expansion or
reinterpretation, and the effectiveness of other techniques that might qualify remains to be
explored. Indeed, some training techniques that might have been expected to yield similar
effects evidently do not; for example, in the case of stimulus-response rearrangement, even the
retrieval of two terms or words per fact or word triplet across separate training trials has been
shown to yield no positive transfer to untested responses (e.g., Pan, Wong, et al., Experiment 2),
a result that is, however, fully consistent with the response congruency factor. In another
example, the use of multiple rephrased initial test questions does not necessarily yield better
transfer to application and inference questions than does repeat presentations of identical
TESTING AND TRANSFER 64
questions (Butler, 2010). It may be the case that broad encoding methods have to foster or
improve the retrieval of associations between multiple pieces of information acquired during
initial study (and not just isolated pieces of information) in order to improve transfer.
all to-be-learned information, explanatory feedback, or extended and detailed feedback) can
generate positive transfer. In fact, positive transfer was observed for 36 of 40 cases from 18
studies, spanning across five transfer categories, which featured elaborative feedback. That
finding supports prior claims on the indirect effects of testing (i.e., activities associated with but
not directly involved in the act of testing itself, such as improved restudy; for further discussion
see Roediger & Karpicke, 2006) for transfer performance (e.g., Balch, 1998; McDaniel, Howard,
et al., 2009; McDaniel & Little, in press; Nguyen & McDaniel, 2016; Pan, Gopal, et al., 2015;
Pan, Wong, et al., 2016; Pan & Rickard, 2017; van Eersel et al., 2016; and others).
By contrast, simple correct answer feedback was not associated with improved transfer
performance in any of the meta-analyses. Thus, it appears that, to reliably enhance transfer,
feedback must include more than just the correct response. This stands in contrast with the
results for the larger test-enhanced learning literature, in which correct answer feedback by itself
Initial test performance. The retrieval success rate on the initial test also substantially
predicts the magnitude of transfer in this literature. That finding is consistent with a positive
correlation in the dataset between initial test performance and transfer effect size (which is
shown in Figure 6). A candidate account of that effect is that high accuracy on the initial test
reflects not only better memory for the target information that is tested, but also more complete
memory for other aspects of the study event, including any inferences or other thoughts that
occurred during the initial study phase that may be relevant for the transfer task. By this
TESTING AND TRANSFER 65
account, when initial test accuracy is high, those other memory aspects are relatively likely to
have been retrieved along with the correct answer on the initial test, yielding (via test-enhanced
learning) relatively high probability of retrieval of those memory aspects on a final transfer test
(i.e., positive transfer relative to a non-testing reexposure control condition). In contrast, low
initial test accuracy likely correlates with partial, or piecewise retrieval of the study event (i.e.,
when overall retrieval performance for the target information is poor, those memories are
presumably also less complete). When initial test performance is low, sometimes the retrieved
pieces of the memory include retrieval of the correct answer, but most often it does not. In either
case, the probability that aspects of the initial study event that may be relevant for transfer will
be retrieved on the initial test is expected to be lower than would be the case when initial test
accuracy is high, leading to poor transfer. This account, along with other candidate accounts of
wherein a moderator in the three-factor transfer framework could be analyzed via simultaneous
moderator fits at the category level, it emerged as statistically significant in both the random-
effects and PEESE analyses. There were four such cases: response congruency and initial test
performance in the test format category, and elaborated retrieval practice in the untested
materials and application and inference categories (in other categories, such analyses were
precluded by the absence of within-category moderator variability). These results reinforce the
Category-Specific Moderators
There were six cases across four categories in which a moderator was identified for only
Test format. The multiple-choice vs. not on the initial test moderator emerged as
significant for this category in both the random-effects and PEESE analyses, with reduced
transfer for the multiple-choice case. That result is consistent with prior hypotheses in the
literature (e.g., Duchastel, 1981; Foos & Fisher, 1991; McDaniel, Anderson, et al., 2007;
McDaniel, McDermott, et al., 2007; McDermott et al., 2014; Rickard & Pan, 2017; Wenger,
moderator was significant in this category in both the random-effects and PEESE analyses. That
finding is consistent with prior observations on the different transfer properties of stimuli with
two vs. three or more elements (e.g., Pan & Rickard, 2017). The basis for that contrasting effect
for superficially similar materials is investigated in the context of the dual memory model in
contrast, elaborated retrieval practice, which in this category most commonly took the form of
elaborative feedback, was a significant moderator in the opposite direction (as occurred in the
overall meta-analyses). It thus appears that feedback which involves more than just the correct
feedback; for examples see Blunt & Karpicke, 2014, Experiment 2; Eglington & Kang, 2016;
Karpicke & Blunt, 2011, Experiment 1; McDaniel, Howard, et al., 2009, Experiment 2; and
Zhou et al., 2013, Experiment 2) is needed for feedback to improve transfer in this category.
When just the correct answer is provided, transfer may not manifest or may even be negative (for
examples see Agarwal, 2011, Experiment 1; Nguyen, Gouravajhala, & McDaniel, 2016; Tran,
Two other moderators, number of training phase repetitions and retention interval, were
also significant in this category only. It is not clear, however, why both of these moderators
would be predictive only for the case of application and inference questions and not in the
overall analyses (as substantial variability in both moderators is present across multiple
categories). In light of the expectation that Type I errors may occur across the large number of
tested moderators at the category level, results for these moderators should be regarded as
tentative.
Problem-solving skills. The significance of the worked examples vs. medical diagnosis
and treatment moderator in this category reflects the starkly different results for its two
training methods (i.e., problem worksheets without feedback vs. a range of testing and feedback
methods), (c) non-testing reexposure control conditions (i.e., worked example study vs. text
restudy), (d) settings (i.e., laboratory vs. clinical), and (e) problem-solving procedures. Due to
the limited data currently available, it is not possible to fully disentangle these content, design,
and procedural differences (which may conflate content with processes; despite that possibility,
the two sub-categories accurately characterize this category as it currently exists). It is however
plausible that testing might yield different degrees of transfer across different problem types.
For new results that address the general absence of feedback in the worked examples sub-
category and other implementation factors, see Yeo and Fazio, (in press).
Publication Bias
PEESE analyses of the overall dataset indicated at least moderate publication bias in the
transfer of test-enhanced learning literature. Nevertheless, in all cases the moderator variables
that were identified in the random-effects analysis remained significant, with modestly reduced
effect estimates (see Figure 7 for all effect sizes plotted against sv, in effect a funnel plot turned
TESTING AND TRANSFER 68
sideways, along with PEESE moderator estimates). The net result was that, when none of the
moderators took values that enhanced transfer, the predicted transfer effect was null or negative.
At the category level, the PEESE analyses with moderators fitted produced statistically
significant evidence for publication bias only in the test format and untested materials categories.
That result may appear to contradict the substantial bias detected in the overall analysis. On
closer examination, however, there was a trend toward publication bias for every category except
for problem-solving skills. Further, the estimated slopes of the sv parameter for those categories
(in PEESE analyses with moderators listed in Table 7) are 3.19, 6.23, 4.35, 5.68, and 1.94, which
compares favorably to the sv estimate of 3.86 in the overall PEESE analyses. That result is
consistent with the conclusion that publication bias is pervasive across most categories in this
literature but was not detected within most categories due to the smaller sample size and lower
statistical power. Finally, both overall and across categories, the sensitivity analyses indicated
that a moderate level of publication bias would decrease intercept effect sizes by about the
significant moderators should be tentative. Nevertheless, it is notable that each of the major
moderators in the three-factor framework lends itself to plausible and relatively straightforward
causal interpretation. Moreover, in most cases they are consistent with prior empirical or
theoretical work. First, although response congruency has not specifically been considered in the
current test-enhanced learning literature as a factor that may moderate transfer (indeed, it appears
not to have been prominently considered for a wide range of transfer contexts since Osgood,
1949), it is sensible and generally consistent with principles of learning that it would. Moreover,
TESTING AND TRANSFER 69
there is independent evidence that making a response increases its subsequent retrieval
availability (e.g., Estes, 1979; for review see Vaughn & Rawson, 2014), and the aforementioned
is consistent with multiple theoretical frameworks of memory (e.g., Healy et al., 2005). Second,
the multiple forms of elaborated retrieval practice identified in this review have all been
hypothesized to yield transfer in the prior literature, with some prior experiment-level support
(e.g., Chan et al., 2006; Hinze et al., 2013; McDaniel, Bugg, et al., 2015). Finally, a higher level
of initial test performance may reflect better and broader learning of target materials (as we
The meta-analyses are also potentially limited by the simplifying assumptions that were
made during the calculation of between- and within-subjects effect sizes, as previously detailed.
Accordingly, although between- vs. within-subjects design did not emerge as a significant
predictor in the overall meta-analyses and did not survive the simultaneous model fits of the
PEESE analyses, our conclusions regarding that moderator should be regarded as tentative.
Moreover, with regards to the moderators identified in this review, there are always the
possibilities that some candidate moderators were not detected due to insufficient power, and that
related word cues, as well as problem-solving skills) there were only a small number of studies
and effect sizes. Thus, although the estimates of aggregate effect size at the category level
should generally be trustworthy, inferences about candidate moderating factors that could be
As detailed at the outset of this review, few theoretical accounts of test-enhanced learning
TESTING AND TRANSFER 70
directly address transfer. Moreover, the wide range of qualitatively different learning and
transfer contexts compounds the challenge of adapting existing theoretical accounts to address
them all. However, the constituent components of the three-factor transfer framework – which
identify important encoding, retrieval, and design factors for transfer in this literature – can be
integrated within existing theoretical perspectives. First, the effect of response congruency is the
most readily accounted for by theories of test-enhanced learning that are framed, at least in part,
in terms of associative memory, such that the concepts of response availability and stimulus-to-
response pathway reinstatement can be readily incorporated. Examples include the elaborative
retrieval (Carpenter & DeLosh, 2006), mediator effectiveness (Pyc & Rawson, 2009), and dual
memory (Rickard & Pan, 2017) theories. Second, with regard to elaborated retrieval practice,
the case of positive transfer where broad encoding methods are used is potentially consistent
with accounts which specifically reference a test-induced process of spreading activation (e.g.,
Carpenter, 2009; Chan et al., 2006; Pyc & Rawson, 2009). Additionally, the effectiveness of
elaborative feedback for transfer is consistent with qualitative accounts of test-enhanced learning
which focus on its indirect effects (e.g., Arnold & McDermott, 2013; Balch, 1998; McDaniel,
Howard, et al., 2009; McDaniel and Little, in press; Nguyen et al., 2016; Nguyen & McDaniel,
2016; Pan, Gopal, et al., 2015; Pan & Rickard, 2017; van Eersel et al., 2016). Uncovering the
precise mechanistic basis for the effectiveness of such training techniques is an important goal
With regards to the broader transfer literature, the three-factor framework readily
connects to the transfer frameworks of Perkins and Salomon (1994) and Barnett and Ceci (2002).
In relation to the former framework, the response congruency effect can be construed as “low”
circumstances of transfer, whereas elaborated retrieval practice may yield more abstract learning
and thus constitute “high” circumstances of transfer (therefore the three-factor transfer
TESTING AND TRANSFER 71
framework incorporates perspectives from both the identical elements and related similarity-
based models of transfer, as well as the general principle and other abstractionist models). In
relation to the latter framework, both response congruency and elaborated retrieval practice may
enhance learners’ ability to recall and execute prior learning on the transfer test. The three-factor
transfer framework also highlights how the structure of the training and transfer tasks, as well as
encoding and retrieval factors, can moderate the magnitude of transfer (as suggested by Gick &
Holyoak, 1987). Finally, as previously discussed, our findings for the response congruency
moderator provide support for prior theories of response similarity and transfer (Wylie, 1933).
We propose that this review’s findings can be distilled into the following four
educationally relevant principles of transfer. We caution the reader, however, that these
principles are derived from research in predominantly laboratory settings, and that the studies to
date cover only a sample of the variety of transfer contexts of educational interest.
1. Transfer is likeliest when the answers on the initial and final tests are the same, and less
likely when they are not. By definition, transfer tests involve questions that are different
in some way from those that were previously encountered. However, if the correct
answers to those new questions are the same (or nearly so) as the correct answers to
questions that were used during training (e.g., whereas an initial test question asks for a
term given a definition, and the transfer test asks for the same term given a real-world
2. Transfer increases when initial tests involve retrieving information broadly (broad
encoding). Transfer is more likely when initial tests involve discriminating between
several questions that address different levels of knowledge (i.e., high and low order
TESTING AND TRANSFER 72
elaborative feedback is provided. Combining practice testing with restudy of all to-be-
4. Transfer increases with higher accuracy on the initial test. The better one performs on
the initial test, the more likely transfer will occur. This candidate principle suggests that
limitations, on the effective use of testing to foster transfer, more research is needed into how
best to implement practice testing to induce transfer in authentic educational and other training
contexts.
For further insights into transfer of test-enhanced learning, new empirical research is
1. New or under-explored transfer contexts – currently, there are only one or a few
published studies regarding category learning and classification (e.g., Baghdady et al.,
2014; Jacoby et al., 2010), visuospatial skills (Carpenter & Kelly, 2012; Rohrer et al.,
2010; cf. Kelly, Carpenter, & Sjolund, 2015), and function learning (Kang et al., 2011).
There are also a number of other prominent transfer contexts that remain almost entirely
unresearched in the test-enhanced learning literature (e.g., analogical transfer and other
types of abstract learning), as well as existing transfer categories that remain relatively
related word cues). Finally, the literature is currently dominated by more “near” than
“far” transfer studies (e.g., very few studies address transfer across knowledge domains).
Agarwal, Bain, & Chamberlain, 2012; Balch, 1998; Bjork, Little, & Storm, 1994;
McConnell, St-Onge, & Young, 2014; McDaniel, Anderson, et al., 2007; McDaniel,
Wildman, et al., 2012), the vast majority of research in this literature has occurred in
develop practice testing into a portable and effective real world learning technique. Thus,
future studies should further investigate testing and transfer in actual classroom and other
learning environments, using methods that approach the level of control afforded by
laboratory studies to the extent practicable. Such research might be more likely to
incorporate more instances of “far” transfer than have been examined in the literature to
3. The theoretical and mechanistic basis for transfer of test-enhanced learning – in tandem
with further research on theoretical mechanisms of the testing effect, more research is
needed into the cognitive processes that yield transfer of test-enhanced learning, as well
as the circumstances under which that learning may or may not transfer (i.e., from both
and elaborative feedback were meta-analyzed for the first time in this review. However,
the number of studies featuring elaborated retrieval practice remains limited, the
retrieval vs. discrimination instructions), and the circumstances under which their
5. The ecological validity and role of stimulus materials – in several categories, the choice
category; medical diagnoses vs. worked examples in the problem-solving category; also
potentially highly semantically related vs. not semantically related questions in the
untested materials category). Further research into such effects is needed. Also, some
researchers (e.g., Butler, 2010; Chan et al., 2006; Wooldridge et al., 2014) have
suggested that the stimulus materials used in some studies appear to be relatively
“contrived” and have questionable ecological validity; further research is also needed to
6. Other moderating factors – due to insufficient data, potentially relevant factors that could
not be investigated in the foregoing meta-analyses include the similarity of cues between
the initial and transfer tests, the role of prior knowledge, the use of hints and reminders at
the transfer test (e.g., to alert learners to the need to transfer knowledge across knowledge
domains, as occurred in Butler, 2010; Experiment 3), and test expectancy manipulations
e.g., Hinze & Rapp, 2014; Nguyen & McDaniel, 2016). There may also be other, as-yet
shown promise at yielding transfer include successive relearning (also called mastery
learning; e.g., Rawson, Dunlosky, & Sciartelli, 2013) and interpolated testing (e.g.,
Wissman, Rawson & Pyc, 2011). Both involve different training methods than the
Conclusions
transfer performance that is often substantially better than that in non-testing reexposure control
conditions such as restudy or rereading. Among the major transfer contexts investigated in the
literature to date, testing generally yields positive transfer across test formats and to application
and inference questions, mediator and related word cues, and problems involving medical
diagnoses; it often yields numerically weak transfer, or in some cases possibly negative transfer,
to stimulus-response rearranged items, untested materials seen during initial study, and problems
involving worked examples (although there are a number of prominent exceptions). Publication
bias appears to be moderate in this literature, reducing the magnitude of, but not eliminating,
retrieval practice, and the level of initial test performance. Together, these moderating factors
form the basis of a new three-factor transfer framework that appears to accommodate the
majority of results in this literature and provides insights into optimizing transfer in applied
settings.
TESTING AND TRANSFER 76
References
Abbott, E. E. (1909). On the analysis of the factor of recall in the learning process. The
doi:http://dx.doi.org/10.1037/h0093018
Adesope, O. O., Trevisan, D. A., & Sundararajan, N. (2017). Rethinking the use of tests: A meta-
analysis of practice testing (vol 87, pg 659, 2017). Review of Educational Research,
87(3), 1-1.
*Agarwal, P. K. (2011). Examining the relationship between fact learning and higher order
Agarwal, P. K., Bain, P. M., & Chamberlain, R. W. (2012). The value of applied research:
doi:http://dx.doi.org/10.1007/s10648-012-9210-2
Agarwal, P. K., Finley, J. R., Rose, N. S., & Roediger, H. L. (2016). Benefits from retrieval
practice are greater for students with lower working memory capacity. Memory (advance
Anderson, L., & Krathwohl, D. A. (2001) Taxonomy for learning, teaching and assessing: a
Anderson, R. C., Kulhavy, R. W., & Andre, T. (1972). Conditions under which feedback
186-188. doi:http://dx.doi.org/10.1037/h0032653
Anderson, R., & Biddle, W. (1975). On asking people questions about what they are reading. In
G. H. Bower (Ed.), The Psychology of Learning and Motivation, (pp. 90-132). New
APA Publications and Communications Board Working Group on Journal Article Reporting
doi:http://dx.doi.org/10.1037/0003-066X.63.9.839
*Avci, G. (2011). Transfer of the testing effect: Just how powerful is it? (Order No.
AAI3464205).
Baghdady, M., Carnahan, H., Lam, E. W. N., & Woods, N. N. (2014). Test‐enhanced learning
and its effect on comprehension and diagnostic accuracy. Medical Education, 48(2), 181-
188. doi:http://dx.doi.org/10.1111/medu.12302
*Balch, W. R. (1998). Practice versus review exams and final exam performance. Teaching of
Barnett, S. M., & Ceci, S. J. (2002). When and where do we apply what we learn?: A taxonomy
doi:http://dx.doi.org/10.1037/0033-2909.128.4.612
Benjamin, A. S., & Pashler, H. (2015). The value of standardized testing: a perspective from
cognitive psychology. Policy Insights from the Behavioral and Brain Sciences, 2(1), 13-
23.
TESTING AND TRANSFER 78
*Bies-Hernandez, N. (2014). Examining the testing effect using the dual-process signal detection
Bjork, E. L., Little, J. L., & Storm, B. C. (2014). Multiple-choice testing as a desirable difficulty
in the classroom. Journal of Applied Research in Memory and Cognition, 3(3), 165-170.
doi:http://dx.doi.org/10.1016/j.jarmac.2014.03.002
Bjork, R. A., & Bjork, E. L. (1992). A new theory of disuse and an old theory of stimulus
connectionist theory; vol. 2: From learning processes to cognitive processes. (pp. 35-67).
Bjork, R.A. (1975). Retrieval as a memory modifier: An interpretation of negative recency and
related phenomena. In R.L. Solso (Ed.), Information processing and cognition: The
*Blunt, J. R., & Karpicke, J. D. (2014). Learning with retrieval-based concept mapping. Journal
Borenstein, M., Hedges, L. V., Higgins, J., & Rothstein, H. R. (2010). A basic introduction to
1, 97–111. doi:http://dx.doi.org/10.1002/jrsm.12
Bourne, L. E., Jr., & Healy, A. F. (2014). Train your mind for peak performance: A science-
based approach for achieving your goals Washington, DC: American Psychological
Association. doi:http://dx.doi.org/10.1037/14319-000
Brookhart, S. M. (2015). Making the Most of Multiple Choice. Educational Leadership, 73(1),
TESTING AND TRANSFER 79
36-39.
Transfer of learning: Contemporary research and applications. (pp. 121-150) San Diego:
Academic Press.
Brown, P. C., Roediger, H. L., & McDaniel, M. A. (2014). Make it stick: The science of
Bujang, M. A., & Baharum, N. (2017). Guidelines of the minimum sample size requirements for
*Butler, A. C. (2010). Repeated testing produces superior transfer of learning relative to repeated
1118-1133. doi:http://dx.doi.org/10.1037/a0019902
*Butler, A. C., & Roediger, H. L. (2007). Testing improves long-term retention in a simulated
doi:http://dx.doi.org/10.1080/09541440701326097
Butler, A. C., Godbole, N., & Marsh, E. J. (2013). Explanation feedback is better than correct
Butler, A. C., Karpicke, J. D., & Roediger, H. L.. (2007). The effect of type and timing of
Carey, B. (2013). Frequent tests can enhance college learning, study finds. The New York Times.
college-learning-study-finds.html
*Carpenter, S. K. (2009). Cue strength as a moderator of the testing effect: The benefits of
retention: Support for the mediator effectiveness hypothesis of the testing effect. Journal
doi:http://dx.doi.org/10.1037/a0024140
Carpenter, S. K., & DeLosh, E. L. (2006). Impoverished cue support enhances subsequent
retention: Support for the elaborative retrieval explanation of the testing effect. Memory
Carpenter, S. K., & Kelly, J. W. (2012). Tests enhance retention and transfer of spatial learning.
012-0221-2
Carpenter, S. K., Lund, T. J. S., Coffman, C. R., Armstrong, P. I., Lamm, M. H., & Reason, R.
doi:http://dx.doi.org/10.1007/s10648-015-9311-9
*Carpenter, S. K., & Pashler, H. (2007). Testing beyond words: Using tests to enhance
doi:http://dx.doi.org/10.3758/BF03194092
*Carpenter, S. K., Pashler, H., & Vul, E. (2006). What types of learning are enhanced by a cued
TESTING AND TRANSFER 81
doi:http://dx.doi.org/10.3758/BF03194004
Carpenter, S. K., Pashler, H., Wixted, J. T., & Vul, E. (2008). The effects of tests on learning and
doi:http://dx.doi.org/10.3758/MC.36.2.438
Carrier, M., & Pashler, H. (1992). The influence of retrieval on retention. Memory & Cognition,
20(6), 633-642.
Chan, J. C. K. (2009). When does retrieval induce forgetting and when does it induce
facilitation? implications for retrieval inhibition, testing effect, and text processing.
doi:http://dx.doi.org/10.1016/j.jml.2009.04.004
Chan, J. C. K. (2010). Long-term effects of testing on the recall of nontested materials. Memory,
Initially nontested material can benefit from prior testing of related material. Journal of
doi:http://dx.doi.org/10.1037/0096-3445.135.4.553
*Cho, K. W., Neely, J. H., Brennan, M. K., Vitrano, D., & Crocco, S. (2017). Does testing
Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale, NJ:
*Coppens, L. C., Verkoeijen, P. P. J. L., Bouwmeester, S., & Rikers, R. M. J. P. (2016). The
testing effect for mediator final test cues and related final test cues in online and
Cranney, J., Ahn, M., McKinnon, R., Morris, S., & Watts, K. (2009). The testing effect,
doi:http://dx.doi.org/10.1080/09541440802413505
*Darabi, A. A., Nelson, D. W., & Palanki, S. (2007). Acquisition of troubleshooting skills in a
doi:http://dx.doi.org/10.1016/j.chb.2005.11.001
Delaney, P. F., Verkoeijen, P. P. J. L., & Spirgel, A. (2010). Spacing and testing effects: A
deeply critical, lengthy, and at times discursive review of the literature. In B. H. Ross
(Ed.), The psychology of learning and motivation: Advances in research and theory (vol.
7421(10)53003-2
Dempster, F. N. (1996). Distributing and managing the conditions of encoding and practice.
012102570-0/50011-2
Detterman, D. K. (1993). The case for the prosecution: Transfer as an epiphenomenon. Transfer
TESTING AND TRANSFER 83
on trial: Intelligence, cognition, and instruction. (pp. 1-24) Westport: Ablex Publishing.
Druckman, D., & Bjork, R. A. (1994). Learning, remembering, believing: Enhancing human
Duchastel, P. C. (1981). Retention of prose following testing with different types of tests.
Duchastel, P. C., & Nungester, R. J. (1982). Testing effects measured with alternate test forms.
Dudai, Y. (2007). Transfer: its transfer into neurobiology. In H. L. Roediger, Y. Dudai & S. M.
Fitzpatrick (Eds.), Science of memory: Concepts, New York: Oxford University Press.
Dunlap, W. P., Cortina, J. M., Vaslow, J. B., & Burke, M. J. (1996). Meta-analysis of
Dunlosky, J., Rawson, K. A., Marsh, E. J., Nathan, M. J., & Willingham, D. T. (2013).
from cognitive and educational psychology. Psychological Science in the Public Interest,
*Eglington, L. G., & Kang, S. H. K. (2016). Retrieval practice benefits deductive inference.
Estes, W. K. (1979). Role of response availability in the effects of cued-recall tests on memory.
doi:http://dx.doi.org/10.1037/0278-7393.5.6.567
TESTING AND TRANSFER 84
Fiorella, L., & Mayer, R. E. (2015). Eight ways to promote generative learning. Educational
Foos, P. W., & Fisher, R. P. (1988). Using tests as learning opportunities. Journal of Educational
Frase, L. T. (1968). Effect of question location, pacing, and mode upon retention of prose
doi:http://dx.doi.org/10.1037/h0025947
Gasparinatou, A., & Grigoriadou, M. (2013). Exploring the effect of background knowledge and
text cohesion on learning from texts in computer science. Educational Psychology, 33(6),
645-670. doi:http://dx.doi.org/10.1080/01443410.2013.790309
George, T., & Wiley, J. (2016, November). Going the distance: The effects of testing on
Boston, MA.
Gick, M. L., & Holyoak, K. J. (1980). Analogical problem solving. Cognitive Psychology, 12(3),
306-355.
Gick, M. L., & Holyoak, K. J. (1987). The cognitive basis of knowledge transfer. Transfer of
learning: Contemporary research and applications. (pp. 9-46). San Diego: Academic
Press.
Glass, G. V., McGaw, B., Smith, M. L. (1981). Meta-analysis in social research. Newbury Park:
Sage Publications.
Glover, J. A. (1989). The "testing" phenomenon: Not gone but nearly forgotten. Journal of
0663.81.3.392
Goode, M. K., Geraci, L., & Roediger, H. L. (2008). Superiority of variable to repeated practice
TESTING AND TRANSFER 85
doi:http://dx.doi.org/10.3758/PBR.15.3.662
*Halamish, V., & Bjork, R. A. (2011). When does testing enhance retention? A distribution-
doi:http://dx.doi.org/10.1037/a0023219
Hamaker, C. (1986). The effects of adjunct questions on prose learning. Review of Educational
Hanawalt, N. G., & Tarr, A. G. (1961). The effect of recall upon recognition. Journal of
Haskell, R. E. (2001). Transfer of learning: Cognition, instruction, and reasoning. San Diego:
Academic Press.
Fitzpatrick (Eds.), Science of memory: Concepts, New York: Oxford University Press.
Healy, A. F., Wohldmann, E. L., & Bourne, L. E., Jr. (2005). The procedural reinstatement
cognitive psychology and its applications; experimental cognitive psychology and its
applications (pp. 59-71, Chapter xxii, 265 Pages) American Psychological Association,
Hedges, L. V., Tipton, E., & Johnson, M. C. (2010). Robust variance estimation in meta-
regression with dependent effect size estimates. Research Synthesis Methods, 1(1), 39–
65. doi:http://dx.doi.org/10.1002/jrsm.5
*Hinze, S. R., & Rapp, D. N. (2014). Retrieval (sometimes) enhances learning: Performance
pressure reduces the benefits of retrieval practice. Applied Cognitive Psychology, 28(4),
TESTING AND TRANSFER 86
597-606. doi:http://dx.doi.org/10.1002/acp.3032
*Hinze, S. R., & Wiley, J. (2011). Testing the limits of testing effects using completion tests.
*Hinze, S. R., Wiley, J., & Pellegrino, J. W. (2013). The importance of constructive
Hogan, R. M., & Kintsch, W. (1971). Differential effects of study and test trials on long-term
recognition and recall. Journal of Verbal Learning & Verbal Behavior, 10(5), 562-567.
*Huff, M. J., Balota, D. A., & Hutchison, K. A. (2016). The costs and benefits of testing and
Huff, M. J., Coane, J. H., Hutchison, K. A., Grasser, E. B., & Blais, J. E. (2012). Interpolated
task effects on direct and mediated false recognition: Effects of initial recall, recognition,
*Jacoby, L. L., Wahlheim, C. N., & Coane, J. H. (2010). Test-enhanced learning of natural
doi:http://dx.doi.org/10.1037/a0020636
Jacoby, L. L., Wahlheim, C. N., & Kelley, C. M. (2015). Memory consequences of looking back
doi:http://dx.doi.org/10.1037/xlm0000123
James, W. (1890). The principles of psychology, vol I. New York: Henry Holt and Co.
TESTING AND TRANSFER 87
doi:http://dx.doi.org/10.1037/10538-000
Jensen, J. L., McDaniel, M. A., Woodard, S. M., & Kummer, T. A. (2014). Teaching to the test
... or testing to teach: Exams requiring higher order thinking skills encourage greater
doi:http://dx.doi.org/10.1007/s10648-013-9248-9
*Johnson, C. I., & Mayer, R. E. (2009). A testing effect with multimedia learning. Journal of
Jones, A. C., Wardlow, L., Pan, S. C., Zepeda, C., Heyman, G. D., Dunlosky, J., and Rickard, T.
C. (2015). Beyond the rainbow: retrieval practice leads to better learning than does
Judd, C. H. (1908). The relation of special training to general intelligence. Educational Review,
36, 28–42.
Kang, S. H. K., McDaniel, M. A., & Pashler, H. (2011). Effects of testing on learning of
doi:http://dx.doi.org/10.3758/s13423-011-0113-x
*Kang, S. H. K., McDermott, K. B., & Roediger, H. L. (2007). Test format and corrective
doi:http://dx.doi.org/10.1080/09541440601056620
Karpicke, J. D., & Aue, W. R. (2015). The testing effect is alive and well with complex
doi:http://dx.doi.org/10.1007/s10648-015-9309-3
TESTING AND TRANSFER 88
*Karpicke, J. D., & Blunt, J. R. (2011). Retrieval practice produces more learning than elaborate
*Karpicke, J. D., Blunt, J. R., Smith, M. A., & Karpicke, S. S. (2014). Retrieval-based learning:
The need for guided retrieval in elementary school children. Journal of Applied Research
Karpicke, J. D., Lehman, M., & Aue, W. R. (2014). Retrieval-based learning: An episodic
context account. The psychology of learning and motivation (vol 61). (pp. 237-284). San
MacMillan (Eds.), Psychological concepts in education (p. 50). Chicago: Rand McNally.
Kelly, J. W., Carpenter, S. K., & Sjolund, L. A. (2015). Retrieval enhances route knowledge
acquisition, but only when movement errors are prevented. Journal of Experimental
doi:http://dx.doi.org/10.1037/a0038685
Kornell, N., Bjork, R. A., & Garcia, M. A. (2011). Why tests appear to prevent forgetting: A
doi:http://dx.doi.org/10.1016/j.jml.2011.04.002
*Kromann, C. B., Bohnstedt, C., Jensen, M. L., & Ringsted, C. (2010). The testing effect on
skills learning might last 6 months. Advances in Health Sciences Education, 15(3), 395-
401. doi:http://dx.doi.org/10.1007/s10459-009-9207-x
*Kromann, C. B., Jensen, M. L., & Ringsted, C. (2011). Test‐enhanced learning may be a
*Kromann, C. B., Jensen, M. L., & Ringsted, C. (2009). The effect of testing on skills learning.
TESTING AND TRANSFER 89
2923.2008.03245.x
Kulhavy, R. W., & Anderson, R. C. (1972). Delay-retention effect with multiple-choice tests.
doi:http://dx.doi.org/10.1037/h0033243
Kulhavy, R. W., & Stock, W. A. (1989). Feedback in written instruction: The place of response
doi:http://dx.doi.org/10.1007/BF01320096
Kuo, T., & Hirshman, E. (1996). Investigations of the testing effect. The American Journal of
Lahey, J. (2014). Students should be tested more, not less. The Atlantic. Available at:
http://www.theatlantic.com/education/archive/2014/01/students-should-be-tested-more-
not-less/283195/
Lakens, D., Hilgard, J., & Staaks, J. (2016). On the reproducibility of meta-analyses: Six
doi:http://dx.doi.org/10.1037/h0076933
*Larsen, D. P., Butler, A. C., & Roediger, H. L. (2013a). Comparative effects of test‐enhanced
doi:http://dx.doi.org/10.1111/medu.12141
*Larsen, D. P., Butler, A. C., Lawson, A. L., & Roediger, H. L. (2013b). The importance of
seeing the patient: Test-enhanced learning with standardized patients and written tests
*Leahy, W., Hanham, J., & Sweller, J. (2015). High element interactivity information during
problem solving may lead to failure to obtain the testing effect. Educational Psychology
*Lechuga, M. T., Ortega-Tudela, J., & Gómez-Ariza, C. J. (2015). Further evidence that concept
mapping is not better than repeated retrieval as a tool for learning from texts. Learning
AAI3493389).
Little, J. L., & Bjork, E. L. (2015). Optimizing multiple-choice tests as tools for learning.
Little, J. L., Bjork, E. L., Bjork, R. A., & Angello, G. (2012). Multiple-choice tests exonerated, at
doi:http://dx.doi.org/10.1177/0956797612443370
Mandler, G., & Rabinowitz, J. C. (1981). Appearance and reality: Does a recognition test really
Marzano, R. J.; Pickering, D.; Pollock, J. E. (2006). Classroom instruction that works: research-
Press.:http://dx.doi.org/10.1017/CBO9780511811678
TESTING AND TRANSFER 91
McConnell, M. M., St-Onge, C., & Young, M. E. (2015). The benefits of testing for learning on
doi:http://dx.doi.org/10.1007/s10459-014-9529-1
& S. M. Fitzpatrick (Eds.), Science of memory: Concepts, New York: Oxford University
Press.
McDaniel, M. A., & Fisher, R. P. (1991). Tests and test feedback as learning sources.
McDaniel, M. A., & Little, J. L. (in press). Multiple-choice and short-answer quizzing on equal
McDaniel, M. A., & Masson, M. E. (1985). Altering memory representations through retrieval.
doi:http://dx.doi.org/10.1037/0278-7393.11.2.371
*McDaniel, M. A., Anderson, J. L., Derbish, M. H., & Morrisette, N. (2007). Testing the testing
doi:http://dx.doi.org/10.1080/09541440701326154
*McDaniel, M. A., Bugg, J. M., Liu, Y., & Brick, J. (2015). When does the test-study-test
*McDaniel, M. A., Howard, D. C., & Einstein, G. O. (2009). The read-recite-review study
doi:http://dx.doi.org/10.1111/j.1467-9280.2009.02325.x
learning from the laboratory to the classroom. Psychonomic Bulletin & Review, 14(2),
200-206. doi:http://dx.doi.org/10.3758/BF03194052
McDaniel, M. A., Thomas, R. C., Agarwal, P. K., McDermott, K. B., & Roediger, H. L. (2013).
*McDaniel, M. A., Wildman, K. M., & Anderson, J. L. (2012). Using quizzes to enhance
doi:http://dx.doi.org/10.1016/j.jarmac.2011.10.001
McDermott, K. B., Agarwal, P. K., D’Antonio, L., Roediger, H. L., & McDaniel, M. A. (2014).
middle and high school classes. Journal of Experimental Psychology: Applied, 20(1), 3-
21. doi:http://dx.doi.org/10.1037/xap0000004
McNamara, D. S., & Kintsch, W. (1996). Learning from texts: Effects of prior knowledge and
doi:http://dx.doi.org/10.1080/01638539609544975
*Meyer, A. N. D., & Logan, J. M. (2013). Taking the testing effect beyond the college freshman:
doi:http://dx.doi.org/10.1037/a0030890
Moher, D., Liberati, A., Tetzlaff, J., Altman, D. G., & The PRISMA Group (2009). Preferred
Reporting Items for Systematic Reviews and Meta-Analyses: The PRISMA Statement.
Morris, C. D., Bransford, J. D., & Franks, J. J. (1977). Levels of processing versus transfer
appropriate processing. Journal of Verbal Learning & Verbal Behavior, 16(5), 519-533.
Morris, S. B., & DeShon, R. P. (2002). Combining effect size estimates in meta-analysis with
125. doi:http://dx.doi.org/10.1037/1082-989X.7.1.105
Mozer, M. C., Howe, M., & Pashler, H. (2004). Using testing to enhance learning: A comparison
Mahwah: Erlbaum.
*Mulligan, N. W., & Peterson, D. J. (2015). Negative and positive testing effects in terms of
*Nguyen, K., & McDaniel, M. A. (2016). The JOIs of text comprehension: Supplementing
*Nguyen, K., Gouravajhala, R., & McDaniel, M. A. (2016). Can testing enhance transfer of
*Nungester, R. J., & Duchastel, P. C. (1982). Testing versus review: Effects on retention.
0663.74.1.18
*Pan, S. C., Gopal, A., & Rickard, T. C. (2015). Testing with feedback yields potent, but
*Pan, S. C., Hutter, S., D’Andrea, D., Unwalla, D., & Rickard, T. C. Investigations of learning
manuscript.
Pan, S. C., Pashler, H., Potter, Z. E., & Rickard, T. C. (2015). Testing enhances learning across a
range of episodic memory abilities. Journal of Memory and Language 83, 53-61.
doi:http://dx.doi.org/10.1016/j.jml.2015.04.001
Pan, S. C., & Rickard, T. C. (2015). Sleep and motor memory: is there room for consolidation?
*Pan, S. C., & Rickard, T. C. (2017). Does retrieval practice enhance learning and transfer for
doi:http://dx.doi.org/10.1037/xap0000124
Pan, S. C., Rubin, B. R., & Rickard, T. C. (2015). Does testing with feedback improve adult
*Pan, S. C., Wong, C. M., Potter, Z. E., Mejia, J., & Rickard, T. C. (2015). Does test-enhanced
learning transfer for triple associates? Memory & Cognition 44(1). doi:http://dx.doi.org/
10.3758/s13421-015-0547-x
Pashler, H., Bain, P., Bottge, B., Graesser, A., Koedinger, K., McDaniel, M., & Metcalfe, J.
(2007). Organizing instruction and study to improve student learning (NCER 2007–
2004). Washington, DC: National Center for Education Research, Institute of Education
Pashler, H., Cepeda, N. J., Wixted, J. T., & Rohrer, D. (2005). When does feedback facilitate
Pashler, H., Rohrer, D., Cepeda, N. J., & Carpenter, S. K. (2007). Enhancing learning and
retarding forgetting: Choices and consequences. Psychonomic Bulletin & Review, 14(2),
187-193. doi:http://dx.doi.org/10.3758/BF03194050
Paul, A. M. (2015). Researchers find that frequent tests can boost learning. Scientific American.
tests-can-boost-learning/
Perkins, D. N., & Salomon, G. (1994). Transfer of learning. In T. Husen & T. N. Postelwhite
(Eds.). International Handbook of Educational Research (Second Edition, Vol. 11; pp.
*Peterson, D. J., & Mulligan, N. W. (2013). The negative testing effect and multifactor account.
1293. doi:http://dx.doi.org/10.1037/a0031337
*Pilotti, M., Chodorow, M., & Petrov, R. (2009). The usefulness of retrieval practice and review-
only practice for answering conceptually related test questions. Journal of General
Popham, W. J. (2011). Classroom assessment: What teachers need to know (6th edn.). Boston,:
Pyc, M. A., & Rawson, K. A. (2010). Why testing improves memory: Mediator effectiveness
Hedges, & J. C. Valentine (Eds.), Handbook of research synthesis and meta-analysis (2nd
TESTING AND TRANSFER 96
Rawson, K. A., & Dunlosky, J. (2011). Optimizing schedules of retrieval practice for durable
Rawson, K. A., & Dunlosky, J. (2012). When is practice testing most effective for improving the
419-435. doi:http://dx.doi.org/10.1007/s10648-012-9203-1
Rawson, K. A., Dunlosky, J., & Sciartelli, S. M. (2013). The power of successive relearning:
doi:http://dx.doi.org/10.1007/s10648-013-9240-4
*Rawson, K. A., Vaughn, K. E., & Carpenter, S. K. (2015). Does the benefit of testing depend
on lag, and if so, why? evaluating the elaborative retrieval hypothesis. Memory &
Rickard, T. C., & Bourne, L. E., Jr. (1996). Some tests of an identical elements model of basic
Rickard, T. C., Healy, A. F., & Bourne, L. E. (1994). On the cognitive structure of basic
arithmetic skills: Operation, order, and symbol transfer effects. Journal of Experimental
doi:http://dx.doi.org/10.1037/0278-7393.20.5.1139
Rickard, T. C., & Pan, S. C. (2017). A dual memory theory of the retrieval practice effect.
*Rickard, T. C., & Pan, S. C. Test-enhanced learning of pairs, triplets, and facts: when and why
TESTING AND TRANSFER 97
Rickard, T. C., Healy, A. F., & Bourne, L. E. (1994). On the cognitive structure of basic
arithmetic skills: Operation, order, and symbol transfer effects. Journal of Experimental
doi:http://dx.doi.org/10.1037/0278-7393.20.5.1139
Fitzpatrick (Eds.), Science of memory: Concepts, New York: Oxford University Press.
Roediger, H. L., & Blaxton, T. A. (1987). Retrieval modes produce dissociations in memory for
Roediger, H. L., & Butler, A. C. (2011). The critical role of retrieval practice in long-term
doi:http://dx.doi.org/10.1016/j.tics.2010.09.003
Roediger, H. L., & Karpicke, J. D. (2006). The power of testing memory: Basic research and
210. doi:http://dx.doi.org/10.1111/j.1745-6916.2006.00012.x
Roediger, H. L., Putnam, A. L., & Smith, M. A. (2011). Ten benefits of testing and their
applications to educational practice. The psychology of learning and motivation (vol 55):
doi:http://dx.doi.org/10.1016/B978-0-12-387691-1.00001-6
Roediger, H. L., & Pyc, M. A. (2012). Inexpensive techniques to improve education: Applying
doi:http://dx.doi.org/10.1016/j.jarmac.2012.09.002
TESTING AND TRANSFER 98
Roediger, H. L., & Marsh, E. J. (2005). The positive and negative consequences of multiple-
*Rohrer, D., Taylor, K., & Sholar, B. (2010). Tests enhance the transfer of learning. Journal of
doi:http://dx.doi.org/10.1037/a0017678
Rowland, C. A. (2014). The effect of testing versus restudy on retention: A meta-analytic review
doi:http://dx.doi.org/10.1037/a0037559
*Rowland, C. A., & DeLosh, E. L. (2015). Mnemonic benefits of retrieval practice at short
doi:http://dx.doi.org/10.1080/09658211.2014.889710
*Rowland, C. A., Littrell-Baez, M., Sensenig, A. E., & DeLosh, E. L. (2014). Testing effects in
doi:http://dx.doi.org/10.3758/s13421-014-0404-3
11(6), 641-650.
Salomon, G., & Perkins, D. N. (1989). Rocky roads to transfer: Rethinking mechanisms of a
Schmidt, R. A., & Bjork, R. A. (1992). New conceptualizations of practice: Common principles
in three paradigms suggest new concepts for training. Psychological Science, 3(4), 207-
217. doi:http://dx.doi.org/10.1111/j.1467-9280.1992.tb00029.x
*Sensenig, A. E. (2011). Multiple choice testing and the retrieval hypothesis of the testing effect
*Sensenig, A. E., Littrell-Baez, M., & DeLosh, E. L. (2011). Testing effects for common versus
doi:http://dx.doi.org/10.1080/09658211.2011.599935
Singley, M. K., & Anderson, J. R. (1989). The transfer of cognitive skill. Cambridge: Harvard
University Press.
Smith, M. A., & Karpicke, J. D. (2014). Retrieval practice with short-answer, multiple-choice,
doi:http://dx.doi.org/10.1080/09658211.2013.831454
Soderstrom, N. C., & Bjork, R. A. (2014). Testing facilitates the regulation of subsequent study
doi:http://dx.doi.org/10.1016/j.jml.2014.03.003
Stanislaw, H., & Todorov, N. (1999). Calculation of signal detection theory measures. Behavior
doi:http://dx.doi.org/10.3758/BF03207704
doi:http://dx.doi.org/10.1002/jrsm.1095
Sternberg, R. (1993). Transfer on trial: Intelligence, cognition, and instruction. Westport: Ablex
Publishing.
Strube, M. J., & Hartmann, D. P. (1983). Meta-analysis: Techniques, applications, and functions.
retentio/
Tanner-Smith, E. E., & Tipton, E. (2014). Robust variance estimation with dependent effect
sizes: Practical considerations including a software tutorial in Stata and SPSS. Research
*Tran, R., Rohrer, D., & Pashler, H. (2015). Retrieval practice: The lack of transfer to deductive
doi:http://dx.doi.org/10.3758/s13423-014-0646-x
Tullis, J. G., Benjamin, A. S., & Ross, B. H. (2014). The reminding effect: Presentation of
Tulving, E. (1984). Précis of elements of episodic memory. Behavioral and Brain Sciences, 7(2),
223-268.
Van den Broek, G., Takashima, A., Wiklund-Hörnqvist, C., Wirebring, L. K., Segers, E.,
Verhoeven, L., & Nyberg, L. (2016). Neurocognitive mechanisms of the “testing effect”:
Van den Bussche, E., Van den Noortgate, W., & Reynvoet, B. (2009). Mechanisms of masked
doi:http://dx.doi.org/10.1037/a0015329; 10.1037/a0015329
*van Eersel, G. G., Verkoeijen, P. P., Povilenaite, M., & Rikers, R. (2016). The testing effect and
*van Gog, T., & Kester, L. (2012). A test of the testing effect: Acquiring problem-solving skills
TESTING AND TRANSFER 101
doi:http://dx.doi.org/10.1111/cogs.12002
van Gog, T., & Paas, F. (2008). Instructional efficiency: Revisiting the original construct in
doi:http://dx.doi.org/10.1080/00461520701756248
van Gog, T., & Sweller, J. (2015). Not new, but nearly forgotten: The testing effect decreases or
doi:http://dx.doi.org/10.1016/j.cedpsych.2010.10.004
*van Gog, T., Kester, L., & Paas, F. (2011). Effects of worked examples, example-problem, and
*van Gog, T., Kester, L., Dirkx, K., Hoogerheide, V., Boerboom, J., & Verkoeijen, P. P. J. L.
(2015). Testing after worked example study does not enhance delayed problem-solving
doi:http://dx.doi.org/10.1007/s10648-015-9297-3
Vaughn, K. E., & Rawson, K. A. (2014). Effects of criterion level on associative memory:
Evidence for associative asymmetry. Journal of Memory and Language, 75, 14-26.
doi:http://dx.doi.org/10.1016/j.jml.2014.04.004
*Veltre, M. T., Cho, K. W., & Neely, J. H. (2015). Transfer-appropriate processing in the testing
*Verkoeijen, P. P. J. L., Bouwmeester, S., & Camp, G. (2012). A short-term testing effect in
doi:http://dx.doi.org/10.1177/0956797611435132
TESTING AND TRANSFER 102
Vevea, J. L., & Hedges, L. V. (1995). A general linear model for estimating effect size in the
Vevea, J. L., & Woods, C. M. (2005). Publication bias in research synthesis: sensitivity analysis
Wenger, S. K., Thompson, C. P., & Bartling, C. A. (1980). Recall facilitates subsequent
135-144. doi:http://dx.doi.org/10.1037/0278-7393.6.2.135
Wissman, K. T., Rawson, K. A., & Pyc, M. A. (2011). The interim test effect: Testing prior
material can facilitate the learning of new material. Psychonomic Bulletin & Review,
*Wooldridge, C. L., Bugg, J. M., McDaniel, M. A., & Liu, Y. (2014). The testing effect with
doi:http://dx.doi.org/10.1016/j.jarmac.2014.07.001
Wuensch, K. L. (2012). Using SPSS to obtain a confidence interval for Cohen’s d. Available at:
http://core.ecu.edu/psyc/wuenschk/SPSS/CI-d-SPSS.pdf.
Wylie, H. H. (1919). An experimental study of transfer of response in the white rat. Behavior
Monographs, 3.
Yeo., D., & Fazio, L. (in press). The optimal learning strategy depends on learning goals and
*Zhou, A., Ma, X., Li, J., & Cui, D. (2013). The advantage effect of retrieval practice on
memory retention and transfer: Based on explanation of cognitive load theory. Acta
Figure 1. Flowchart of the literature search and selection process (n refers to individual studies).
TESTING AND TRANSFER 104
a.
Free recall to
recognition
Cued recall to
free recall
Cued recall to
recognition
Cued recall to
multiple-
choice
Multiple-
choice to cued
recall
Figure 2, panel a. Forest plot of effect sizes (Cohen’s d) with 95% confidence intervals for the transfer
across test formats category. Study order matches Table 1. (*) denotes exclusions due to non-
independent reexposure controls; (†) denotes strong response congruency and/or elaborated retrieval
practice.
TESTING AND TRANSFER 105
b.
Stimulus- Paired
response associates
rearrangement
Triple
associates
Multi-term
facts
Term-
definition
facts
Untested
materials seen
during initial
study
Figure 2, panel b. Forest plot of effect sizes (Cohen’s d) with 95% confidence intervals for the transfer to
stimulus-response rearrangement and to untested materials categories. Study order matches Table 1. (*)
denotes exclusions due to non-independent reexposure controls; (†) denotes strong response congruency
and/or elaborated retrieval practice.
TESTING AND TRANSFER 106
c.
Application Application
and inference questions
questions only
Inference
questions
only
Application
and inference
questions
Figure 2, panel c. Forest plot of effect sizes (Cohen’s d) with 95% confidence intervals for the transfer to
application and inference questions category. Study order corresponds to Table 1. (*) indicates
exclusions due to non-independent reexposure controls; (†) indicates strong response congruency and/or
elaborated retrieval practice.
TESTING AND TRANSFER 107
d.
Problem- Medical
solving diagnosis and
skills treatment
Worked
examples
Related
word cues
Figure 2, panel d. Forest plot of effect sizes (Cohen’s d) with 95% confidence intervals for the transfer of
problem-solving skills and mediator and related word cues categories. Study order corresponds to Table
1. (*) indicates exclusions due to non-independent reexposure controls; (†) indicates strong response
congruency and/or elaborated retrieval practice.
TESTING AND TRANSFER 108
Figure 3. Scatterplot of testing vs. transfer effect sizes (Cohen’s d), from 28 studies that
assessed both effects from within the same experiments (there were 81 such cases; all categories
except problem-solving skills are represented). The dotted line represents the hypothetical case
of equal testing and transfer effect magnitude (for points above the line, the transfer effect is
larger; for the reverse case, the testing effect is larger). The solid diagonal lines represent the
best least squares regression fit to data for categories with strong response congruency (i.e., test
format and mediator and related cues) and no response congruency (all other categories).
TESTING AND TRANSFER 109
a. b.
Figure 4. A three-factor framework for transfer of test-enhanced learning. Lines represent predicted transfer effect sizes as a function of
three factors: initial test performance and the presence or absence of response congruency and elaborated retrieval practice. Panel a: random-
effects analysis estimates. Panel b: PEESE analysis estimates. Effect size estimates are drawn across the full range of proportion correct in the
dataset (which was from 0.19 to 0.98).
TESTING AND TRANSFER 110
a. b. c. d.
Stimulus
Initial test Stimulus Stimulus Mediator Response Mediator Stimulus Response
Response
(cue) cue (target) cue (cue) (target)
Stimulus
Response
Stimulus
Figure 5. Four scenarios of positive transfer via response congruency and stimulus-to-response pathway reinstatement. Solid arrows
represent proposed pathways. Panel a: the transfer stimulus, stimulus′, is a minimally modified version of the stimulus on the initial test (as
occurs in many cases of transfer across test formats, e.g., multiple-choice to cued recall as in Pan, Gopal, et al., 2015). Panel b: partial
pathway reinstatement for mediator cues on the transfer test, wherein an associative pathway that was directly formed between the mediator
and the target on the initial test is reactivated (Coppens et al., 2016). Panel c: full stimulus-to-response pathway reinstatement for mediator
cues on the transfer test wherein the mediator prompts the original cue and leads to recall of the target (Cho et al., 2017). Panel d: the stimuli
on the transfer test are a subset of, or overlap with, the stimuli on the initial test (as may occur for some cases of problem-solving involving
medical diagnosis and treatment, e.g., Larsen et al., 2016a).
TESTING AND TRANSFER 111
Figure 6. Scatterplot depicting the relationship between transfer effect size (Cohen’s d) and
initial test performance (proportion correct), where reported (k = 135 effect sizes), across all six
major transfer categories. The diagonal line represents the best least squares regression fit to the
data.
TESTING AND TRANSFER 112
Figure 7. All 192 transfer effect sizes (Cohen’s d) in the overall dataset plotted as a function of
sampling variability (sv). PEESE analysis estimates for the different levels and combinations of
response congruency and elaborated retrieval practice are also plotted. The intercepts of the
plotted moderator lines depict the estimated effect size for each moderator when sv is at a
hypothetical value of zero, and hence in principle a state of no publication bias. Inspection of the
scatterplot suggests publication bias; as sv increases, the upper half of the plot has more effect
sizes than the lower half.
TESTING AND TRANSFER 113
Table 1.
Studies of Transfer of Test-Enhanced Learning
Study design
Category Sub-category Reference Exp. Stimuli Feedback Delay (hr) Train Final Initial test condition Reexposure control Transfer test NT NR d sv
Test format
Free recall to
cued recall
Karpicke & Blunt, 2011 1 Passage Y (R) 168 FR CR RP Rpt. study Verbatim 20 20 1.15 0.13
Zhou et al., 2013 1 Passage Y (R) 168 FR CR RP Restudy Factual 40 40 0.38 0.052
2 Passage Y (R) 168 FR CR RP Concept mapping Factual 43 0.44 0.027
Blunt & Karpicke, 2014 2 Passage Y (R) 168 FR CR RP, paragraph Rpt. study, paragraph Verbatim 20 20 0.82 0.12
Free recall to
recognition
Verkoeijen et al., 2012 Words N 0.033 FR Recog Testing Restudy Within language 33 -0.15 0.033
Huff et al., 2016 1 Words N 0 FR Recog Recall, list Restudy Corrected recog. 39 40 0.66 0.055
2 Words N 33 FR Recog Recall, list Restudy Corrected recog. 18 19 1.09 0.13
3 Words N 33 FR Recog Recall, list Restudy Corrected recog. 30 -0.18 0.036
Cued recall to
free recall
Carpenter et al., 2006 1 PAL Y 33 CR FR Test/study Study Recall bs 35 0.64 0.037
2 PAL Y 33 CR FR Test/study Study Recall bs 26 0.42 0.046
Carpenter & Pashler, 2007 Maps Y 0.5 CR FR Test/study Study LQ, abs. acc. 50 0.48 0.023
Carpenter, 2009 1 PAL N 0.083 CR FR Test Study 60 1.41 0.032
2 PAL N 0.083 CR FR Test Study Final test 76 76 1.19 0.031
Rohrer et al., 2010 1 Maps Y 24 CR FR TS Study-only Transfer test 28 0.77 0.051
Sensenig et al., 2011 1 Words N 0.083 CR FR Tested, occupation/name Restudied Final test 98 0.51 0.012
3 Words N 0.083 CR FR Tested, name/noun Restudied Final test 103 0.66 0.012
Halamish & Bjork, 2011 1 PAL N 0.14 CR FR STT SSS Difficult test 24 1.44 0.10
2 PAL N 0.14 CR FR STT SSS Difficult test 20 0.66 0.070
Peterson & Mulligan, 2013 3 PAL Y 0 CR FR Retrieval Restudy Final test 18 18 0.78 0.13
Karpicke et al., 2014 3 Texts N 0 CR FR Guided retrieval Reread Final test 85 0.42 0.013
Rowland et al., 2014 1 Words N 0.083 CR FR Test, pure+mixed Restudy, pure+mixed Final test 64 0.47 0.018
2 Words N 0.083 CR FR Test, pure+mixed Restudy, pure+mixed Final test 43 0.93 0.036
3 Words N 0.067 CR FR Test, pure list Restudy, pure list Final test 23 23 0.51 0.094
Rowland & DeLosh, 2015 1 Words N 0.075 CR FR Tested Restudy Final test 36 -0.27 0.031
2 Words N 0.075 CR FR Tested Restudy Final test 71 0.19 0.015
3 Words N 0.038 CR FR Tested Restudy Final test 63 0.72 0.020
4 Words Y 0.017 CR FR Tested Restudy Final test 38 0.92 0.039
Mulligan & Peterson, 2015 4 PAL Y 0.25 CR FR Retrieval, pure list Restudy Final test 28 28 0.42 0.076
4 PAL Y 0.25 CR FR Retrieval, mixed list Restudy Final test 28 1.29 0.074
(table continues)
TESTING AND TRANSFER 114
Table 1 (continued)
Study design
Category Sub-category Reference Exp. Stimuli Feedback Delay (hr) Train Final Initial test condition Reexposure control Transfer test NT NR d sv
Cued recall to
recognition
Jacoby et al., 2010 1 CAT Y 0 CR Recog. STsTsTs SSSS Hit rate 40 0.75 0.034
2 CAT Y 0 CR Recog. STsTsTsTsTs SSSSSS Hit rate 36 0.58 0.035
3 CAT N 12 CR Recog SSSTTT‡ SSSSSS Hit rate 72 0.48 0.016
3 CAT N 12 CR Recog STTTTT‡ SSSSSS Hit rate 72 0.68 0.018
Carpenter, 2011 1 PAL N 0.083 CR Recog Test Study Hit rate targets 30 29 0.51 0.073
Bies-Hernandez, 2013 1 Words Y 48 CR Recog ST Restudying Final test 25 25 0.79 0.090
1 Words Y 48 CR Recog STTT Restudying Final test 25 25 1.13 0.098
2 Words Y 48 CR Recog ST Restudying Standard 25 25 0.75 0.090
2 Words Y 48 CR Recog ST Restudying Recollection 25 25 1.03 0.095
Cued recall to
multiple-choice
Nungester & Duchastel, 1982 Passages N 336 CR MC Test Review Old items, MC 31 34 0.81 0.069
Kang et al., 2007 1 Passages N 72 CR MC SA initial test Read statements Final MC 48 -0.30 0.023
2 Passages Y 72 CR MC SA initial test Read statements Final MC 48 0.47 0.024
Avci, 2011 4 Passages N 168 CR MC SA condition Reread Final MC 41 0.53 0.029
Hinze & Wiley, 2011 3 Passages N 48 CR MC Paragraph recall Reread Final test 25 0.44 0.048
3 Passages Y (R) 48 CR MC Paragraph recall Reread Final test 25 0.35 0.047
McDaniel et al., 2012 1 Facts Y (E) 504 CR MC Short answer Read Identical 16 1.05 0.12
2 Facts Y (E) 264 CR MC Short answer Read Identical 27 0.82 0.055
Wooldridge et al., 2014 1 Passages Y 48 CR MC Repeated fact Highlight Fact qs. 25 29 1.52 0.10
Multiple-choice
to cued recall
Nungester & Duchastel, 1982 Passages N 336 MC CR Test Review Old items, SA 31 34 0.92 0.071
Butler & Roediger, 2007 Video Mixed** 672 MC CR Multiple choice Study Final test 27 0.01 0.040
Kang et al., 2007 1 Passages N 72 MC CR MC initial test Read statements Final SA 48 0.28 0.023
2 Passages Y 72 MC CR MC initial test Read statements Final SA 48 0.17 0.022
Sensenig, 2010 1 Passages N 0.083 MC CR Multiple choice Re-studied Final test 74 -0.44 0.015
3 Passages N 0.083 MC CR Multiple choice Re-studied Final test 54 0.18 0.020
Avci, 2011 4 Passages N 168 MC CR MC condition Reread Final SA 41 0.60 0.031
Meyer & Logan, 2013 Facts Y** 24 MC CR Tested‡ Restudied Final test 180 1.03 0.0084
Pan, Gopal, et al., 2015 3 Facts Y 48 MC CR Tested Restudied Final test 52 0.52 0.023
(table continues)
TESTING AND TRANSFER 115
Table 1 (continued)
Study design
Category Sub-category Reference Exp. Stimuli Feedback Delay (hr) Train Final Initial test condition Reexposure control Transfer test NT NR d sv
Stimulus-response
rearrangement
Paired associates
Carpenter et al., 2006 1 PAL Y 33 CR CR Test/study Study ?→B 53 0.86 0.026
1 PAL Y 33 CR FR Test/study Study Recall As 45 0.60 0.028
2 PAL Y 33 CR CR Test/study Study ?→B 19 0.48 0.067
2 PAL Y 33 CR FR Test/study Study Recall As 18 1.01 0.10
Cheng, 2014 1 PAL N 0.25 MC CR Test Restudy Backward 50 0.72 0.027
Rickard & Pan, 2018 1 PAL Y 24 CR CR Tested Restudied Tested-rearranged 33 0.89 0.046
1 PAL Y 168 CR CR Tested Restudied Tested-rearranged 25 0.72 0.057
App. PAL Y 24 CR CR Tested Restudied Tested-rearranged 33 0.58 0.038
Triple associates
Pan, Wong, et al., 2016 1 Triplets Y 168 CR CR Tested Restudied Tested-inverted 42 0.23 0.026
2 Triplets Y 168 CR CR Tested Restudied Tested-inverted 58 0.16 0.018
Rickard & Pan, 2018 1 Triplets Y 24 CR CR Tested Restudied Tested-rearranged 32 0.06 0.033
1 Triplets Y 168 CR CR Tested Restudied Tested-rearranged 29 0.08 0.037
App. Triplets Y 24 CR CR Tested Restudied Tested-rearranged 35 0.08 0.030
Multi-term facts
McDaniel et al., 2007 Chapters Y(D) 504 CR MC Quizzed Read only Unit exam 34 0.41 0.038
Hinze & Wiley, 2011 1 Passages N 48 CR CR FITB testing Restatement Related 69 0.14 0.015
2 Passages N 168 CR CR FITB testing Restatement Related 26 -0.07 0.042
Pan, Gopal, et al., 2015 1 Facts Y 48 CR CR Tested Restudied Transfer 38 -0.01 0.028
2 Facts Y 48 CR CR Tested, 2x Restudied, 2x Transfer, 2x 58 -0.09 0.018
3 Facts Y 48 MC CR Tested Restudied Transfer 52 -0.30 0.021
4 Facts Y 24 CR CR Tested, 1x Restudied, 1x Transfer, 1x 45 -0.22 0.024
4 Facts Y 24 CR CR Tested, 2x Restudied, 2x Transfer, 2x 45 -0.01 0.023
Pan, Hutter, et al., 2018 1 Facts Y 48 CR CR Tested Restudied Transfer 51 0.03 0.020
2 Facts Y 48 CR CR Tested Restudied Transfer 57 -0.17 0.018
3 Facts Y 48 CR CR Tested Restudied Transfer 56 0.19 0.019
4 Facts Y 48 CR CR Tested Restudied Transfer 57 0.16 0.018
Term-definition facts
McDaniel et al., 2015 1 Passage N 120 MC CR TTT* (BE) SSS* Diff. stem, definition 26 25 0.87 0.090
1 Passage Y (R) 120 MC CR TST (BE) SSS Diff. stem, definition 24 25 0.39 0.087
2 Passage Y 120 MC CR TTT* (BE) SSS* Definition 36 35 0.10 0.058
2 Passage Y (R) 120 MC CR TST (BE) SSS Definition 36 35 0.59 0.061
Pan & Rickard, 2017 2 Facts Y 48 MC MC Tested Restudied Tested-different 31 0.21 0.035
2 Facts Y 48 MC MC Tested Restudied Tested-different 28 0.14 0.039
3a Facts Y 48 MC CR Tested Restudied Tested-different 49 0.09 0.021
3b Facts Y 48 MC CR Tested Restudied Tested-different 52 -0.02 0.020
4 Facts Y 48 MC CR Tested Restudied Tested-different 54 0.15 0.019
4 Facts Y 48 MC CR Tested Restudied Tested-different 59 0.09 0.018
(table continues)
TESTING AND TRANSFER 116
Table 1 (continued)
Study design
Category Sub-category Reference Exp. Stimuli Feedback Delay (hr)Train Final Initial test condition Reexposure control Transfer test NT NR d sv
Untested materials seen
during initial study
None specified
La Porte & Voss, 1975 Passages N 168 CR CR Question 50 Statement 50 New qs. 24 24 -0.18 0.088
Passages N 168 CR CR Question 100* Statement 100* New qs. 24 24 0.36 0.089
Passages Y 168 CR CR Question KR 50* Statement 50* New qs. 24 24 0.22 0.088
Passages Y 168 CR CR Question KR 100 Statement 100 New qs. 24 24 -0.09 0.087
Nungester & Duchastel, 1982 Passage N 336 MC/CR MC/CR Test Review Subset B 31 34 -0.25 0.064
Balch, 1998 1 Course Y (R) 168 MC MC Practice-exam Review Final test 66 66 0.37 0.031
Chan et al., 2006 1 Passage N 24 CR CR Testing Extra study Not presented, Day 1 28 28 0.70 0.079
3 Passages N 24 CR CR Narrow retrieval* Extra study* Related 18 18 0.46 0.14
3 Passages N 24 CR CR Broad retrieval (BE) Extra study Related 18 18 1.01 0.13
Pilotti et al., 2009 Course Y 0 MC MC RP Reviewing Final test 127 129 -0.13 0.016
Butler, 2010 1a/1b/2 Passages Y 168 CR CR Same/variable test‡ Re-study passages Control 72 -0.44 0.016
Agarwal, 2011 1 Passages Y 48 MC MC Higher order quiz Study twice Concept qs. 48 -0.03 0.022
Little, 2011 5 Passages Y (R) 0.083 MC CR Standard Extended-study Related 28 -0.13 0.039
5 Passages Y (R) 0.083 MC CR Discrimination (BE) Extended-study Related 28 0.41 0.042
McDaniel et al., 2012 1 Chapters Y (E) 504 MC MC Multiple-choice* Read* Related 16 0.48 0.082
1 Chapters Y (E) 504 CR MC Short answer Read Related 16 0.49 0.083
Hinze et al., 2013 2 Passages N 192 CR MC Paragraph recall Reread Detail test 25 26 0.82 0.089
2 Passages N 192 CR MC Expect detail* Reread* Detail test 23 26 0.06 0.086
2 Passages N 192 CR MC Expect inference* Reread* Detail test 23 26 0.58 0.090
3 Passages N 192 FR MC Free recall* Reread* Detail test 31 31 0.16 0.067
3 Passages N 192 FR MC Explain (BE) Reread Detail test 31 31 0.65 0.071
Hinze & Rapp, 2014 1 Passages N 168 CR MC Low-stakes quiz Rereading Detail 21 21 -0.22 0.10
1 Passages N 168 CR NC High-stakes quiz* Rereading* Detail 19 21 -0.84 0.12
Wooldridge et al., 2014 1 Chapter Y 48 CR MC Related fact quiz Highlight Fact qs. 20 29 0.31 0.088
2 Chapter Y (R) 48 CR MC Quiz-restudy Highlight Fact qs. 24 24 0.14 0.087
Application and
inference
Application
questions only
Hinze & Rapp, 2014 1 Passages N 168 CR CR Low-stakes quiz Rereading Application 21 21 0.65 0.11
1 Passages N 168 CR CR High-stakes quiz* Rereading* Application 19 21 -0.08 0.11
Wooldridge et al., 2014 1 Chapter Y 48 CR MC Related application Highlight Application 30 24 -0.33 0.079
2 Chapter Y (R) 48 CR MC Quiz-restudy Highlight Application 24 24 0.12 0.087
McDaniel et al., 2015 1 Passage N 120 MC CR TTT*(BE) SSS* Diff. stem, application 26 25 0.48 0.084
1 Passage Y (R) 120 MC CR TST (BE) SSS Diff. stem, application 24 25 0.59 0.089
2 Passage Y 120 MC CR TTT*(BE) SSS* Application 36 35 0.29 0.059
2 Passage Y (R) 120 MC CR TST (BE) SSS Application 36 35 0.68 0.062
Nguyen & McDaniel, 2016 1 Passages Y (R) 0.37 FR CR Standard 3R Meta-notetaking Problem-solving 32 32 0.14 0.065
(table continues)
TESTING AND TRANSFER 117
Table 1 (continued)
Study design
Category Sub-category Reference Exp. Stimuli Feedback Delay (hr)Train Final Initial test condition Reexposure control Transfer test NT NR d sv
1 Passages Y (R) 0.37 FR CR Meta-3R* Meta-notetaking* Prob.-solving 32 32 0.14 0.065
Nguyen et al., 2016 2 Passages Y 48 MC MC Novel quiz Novel presentation Transfer 22 23 -0.02 0.093
2 Passages Y 48 MC MC Text-verbatim quiz* Novel presentation* Final test 22 23 -0.61 0.098
Inference questions only
Hinze & Rapp, 2014 1 Passages N 168 CR MC Low-stakes quiz Rereading Inference 21 21 0.53 0.10
1 Passages N 168 CR MC High-stakes quiz* Rereading* Inference 19 21 -0.30 0.11
Tran et al., 2014 1 Premises Y 0 CR MC RP Reread Final test 68 0.16 0.015
2 Premises Y 48 CR MC RP Reread Final test 40 -0.06 0.026
3 Premises Y 0 CR MC RP Reread Final test 84 -0.21 0.012
3 Premises Y 48 CR MC RP Reread Final test 84 -0.05 0.012
4 Premises Y 48 CR MC RP Reread Inference 84 -0.19 0.012
Lechuga et al., 2015 Passage Y (R) 168 FR CR Repeated retrieval Rpt. study Inference 20 20 0.15 0.11
Eglington & Kang, 2016 App. Premises Y 49.5 CR MC RP Restudy Final test 45 0.07 0.023
1 Premises Y(D) 49.5 CR MC RP Restudy Final test 56 0.38 0.020
2 Premises Y(D) 49.5 CR MC RP, simultaneous Restudy Final test 64 0.35 0.017
2 Premises Y 49.5 CR MC RP, single Restudy Final test 64 -0.09 0.016
3 Premises Y(D) 49.5 FR MC RP Restudy Final test 45 0.38 0.025
Application and
inference questions
Johnson & Mayer, 2009 Anim. N 0.083 CR CR Practice-retention* Restudy* New qs. 53 53 -0.46 0.040
Anim. N 168 CR CR Practice-retention* Restudy* New qs. 59 59 -0.23 0.035
Anim. N 0.083 CR CR Practice-transfer Restudy New qs. 27 53 0.19 0.058
Anim. N 168 CR CR Practice-transfer Restudy New qs. 31 59 0.68 0.078
McDaniel et al., 2009 1 Passages Y (R) 0 FR CR 3R Rereading Inference 24 24 -0.07 0.087
1 Passages Y (R) 168 FR CR 3R Rereading Inference 24 24 0.26 0.088
2 Passages Y (R) 0 FR CR 3R Rereading Prob.-solving 24 24 0.65 0.092
2 Passages Y (R) 168 FR CR 3R Rereading Prob.-solving 24 24 0.65 0.092
Butler, 2010 1b Passages Y 168 CR CR Same test Re-study passages Factual inf. 24 1.17 0.081
1b Passages Y 168 CR CR Variable test* Re-study passages* Factual inf, 24 0.89 0.066
1b Passages Y (E) 168 CR CR Same test* Re-study passages* Conceptual inf. 24 0.91 0.067
1b Passages Y (E) 168 CR CR Variable test Re-study passages Conceptual inf. 24 1.04 0.074
2 Passages Y 168 CR CR Same test* Re-study sentences* Factual inf. 24 0.66 0.057
2 Passages Y (E) 168 CR CR Variable test Re-study sentences Conceptual inf. 24 0.53 0.053
3 Passages Y (E) 168 CR CR Same test Re-study passages Conceptual inf. 20 1.17 0.10
Agarwal, 2011 1 Passages Y 48 MC MC Concept quiz Study twice Higher order 48 -0.11 0.022
Karpicke & Blunt,2011 1 Passage Y (R) 168 FR CR RP Rpt. study Inference qs. 20 20 0.50 0.11
Hinze et al., 2013 2 Passages N 168 CR MC Paragraph recall Reread Inference 25 26 0.53 0.085
2 Passages N 168 CR MC Expect detail* Reread* Inference 23 26 0.10 0.086
2 Passages N 168 CR MC Expect inference* Reread* Inference 23 26 0.68 0.091
3 Passages N 168 FR MC Free recall* Reread* Inference 31 31 0.19 0.067
(table continues)
TESTING AND TRANSFER 118
Table 1 (continued)
Study design
Category Sub-category Reference Exp. Stimuli Feedback Delay (hr) Train Final Initial test condition Reexposure control Transfer test NT NR d sv
3 Passages N 168 CR MC Explain (BE) Reread Inference 31 31 0.49 0.069
Zhou et al., 2013 1 Passage Y (R) 168 FR CR RP Restudy Inference 40 40 0.22 0.052
2 Passage Y (R) 168 FR CR RP Concept mapping Inference 43 0.32 0.026
Blunt & Karpicke, 2014 2 Passage Y (R) 168 FR CR RP, paragraph Rpt. Study, paragraph Inference 20 20 0.58 0.11
2 Passage Y (R) 168 FR CR RP, concept map Rpt. Study, concept map Inference 20 20 0.35 0.11
Nguyen & McDaniel, 2016 1 Passages Y (R) 0.37 FR MC Standard 3R Meta-notetaking Inference 32 32 -0.27 0.065
1 Passages Y (R) 0.37 FR MC Meta-3R* Meta-notetaking* Inference 32 32 0.46 0.066
van Eersel et al., 2016 1 Passages Y (E) 0.083 CR CR Testing Rereading Final test 24 1.23 0.085
1 Passages Y (E) 168 CR CR Testing Rereading Final test 24 1.08 0.076
2 Passages Y (E) 168 CR CR Testing Reread-plus-feedback Final test 54 0.42 0.021
Problem-solving
skills
Medical diagnosis
and treatment
Kromann et al., 2009 Course Y** 336 CST CST Intervention Control Final test 41 40 0.93 0.056
Kromann et al., 2010 Course Y** 4320 CST CST Intervention Control Final test 48 41 0.40 0.047
Kromann et al., 2011 Course Y** 336 CST CST Intervention Control Final test 66 72 0.55 0.031
Larsen et al., 2013a Course Y (D) 4320 SA CR Test, no self-expl. Study, no self-expl. Final test 49 0.70 0.027
Course Y (D) 4320 SA CR Test, self-expl. (BE) Study, self-expl. Final test 49 0.48 0.024
Larsen et al., 2013b Proced. Y (D) 4320 SPT SPT Standardized patient test Review SPT 41 0.80 0.034
Proced. Y (D) 4320 SA SPT Written test* Review* SPT 41 0.33 0.027
Worked examples
Darabi et al., 2007 Probs. N 342 PST PST Problem group Product group Transfer test 22 22 0.96 0.11
van Gog et al., 2011 Probs. N 0 PST PST Example-problem Example-example Final test 22 26 -0.02 0.088
van Gog & Kester, 2012 Probs. N 0.083 PST PST STST-T-T SSSS-T-T Final test 20 20 -0.25 0.11
Leahy et al., 2015 1 Probs. Y 0 PST PST Worked examples+prob. Worked examples only Final test 17 16 -0.73 0.14
2 Probs. Y 0 PST PST Worked examples+prob. Worked examples only Final test 24 25 -0.68 0.091
3 Probs. Y 168 PST PST Worked examples+prob. Worked examples only Final test 14 13 0.05 0.16
van Gog et al., 2015 1 Probs. N 0.083 PST PST EPEP, isomorphic EEEE, isomorphic Final test 20 20 0.24 0.11
1 Probs. N 168 PST PST EPEP, isomorphic* EEEE, isomorphic* Final test 20 20 0.04 0.11
1 Probs. N 0.083 FR PST ERER, isomorphic* EEEE, isomorphic* Final test 20 20 -0.10 0.11
1 Probs. N 168 FR PST ERER, isomorphic EEEE, isomorphic Final test 20 20 0.22 0.11
2 Probs. N 0.083 PST PST EPEP EEEE Isomorphic 15 15 0.21 0.14
2 Probs. N 168 PST PST EPEP* EEEE* Isomorphic 15 15 -0.08 0.14
3 Probs. N 168 PST PST EEEE-PP-isomorphic EEEE-EE-isomorphic Final test 29 34 0.07 0.066
3 Probs. N 168 PST PST EEEE-PP-identical EEEE-EE-identical Final test 33 33 0.04 0.063
4 Probs. N 168 PST PST Testing Restudy Final test 30 27 0.34 0.074
(table continues)
TESTING AND TRANSFER 119
Table 1 (continued)
Study design
Category Sub-category Reference Exp. Stimuli Feedback Delay (hr) Train Final Initial test condition Reexposure control Transfer test NT NR d sv
Mediator and
related word cues
Mediator cues
Carpenter, 2011 2 PAL N 0.5 CR CR Test Study Semantic mediator 10 10 1.84 0.34
Rawson et al., 2015 1 PAL N 48 CR CR Test Restudy Mediator, short lag 33 31 1.19 0.077
1 PAL N 48 CR CR Test Restudy Mediator, long lag 31 30 1.25 0.082
2 PAL N 48 CR CR STSTST Restudy Mediator, short lag 50 40 0.67 0.049
2 PAL N 48 CR CR STSTST* Restudy* Mediator, long lag 53 45 1.06 0.048
2 PAL N 48 CR CR SSSTTT* Restudy* Mediator, short lag 45 40 0.38 0.049
2 PAL N 48 CR CR SSSTTT Restudy Mediator, long lag 51 45 0.95 0.048
App. PAL N 0.33 CR CR STT Restudy Mediator, short lag 32 32 0.71 0.069
App. PAL N 0.33 CR CR STT Restudy Mediator, long lag 32 32 0.57 0.067
Coppens et al., 2016 1 PAL N 0.5 CR CR Testing, no-mc Restudy Mediator cues 17 28 0.52 0.10
1 PAL N 0.5 CR CR Testing, strong-mc Restudy Mediator cues 23 26 0.24 0.086
2 PAL N 0.5 CR CR Testing Restudy Mediator cues 47 44 0.42 0.046
3 PAL N 0.5 CR CR Testing Restudy Mediator cues 26 30 0.94 0.083
Cho et al., 2017 1 PAL N 0.5 CR CR Test Restudy .20 M-C Pairs, mediator 38 38 0.50 0.056
1 PAL N 0.5 CR CR Test Restudy .00 M-C Pairs, mediator 38 38 0.89 0.060
Related cues
Carpenter, 2011 2 PAL N 0.5 CR CR Test Study New related 10 10 0.65 0.24
Veltre et al., 2013 Words Y 48 CR CR Semantic/different-type Restudy Final test 48 48 0.54 0.044
Rawson et al., 2015 1 PAL N 48 CR CR Test Restudy Related, short lag 33 31 0.42 0.066
1 PAL N 48 CR CR Test Restudy Related, long lag 31 30 1.18 0.08
2 PAL N 48 CR CR STSTST Restudy Related, short lag 50 40 0.48 0.047
2 PAL N 48 CR CR STSTST* Restudy* Related, long lag 53 45 0.32 0.043
2 PAL N 48 CR CR SSSTTT* Restudy* Related, short lag 45 40 0.26 0.049
2 PAL N 48 CR CR SSSTTT Restudy Related, long lag 51 45 0.57 0.045
App. PAL N 0.33 CR CR STT Restudy Related, short lag 32 32 0.37 0.066
App. PAL N 0.33 CR CR STT Restudy Related, long lag 32 32 0.43 0.066
Coppens et al., 2016 1 PAL N 0.5 CR CR Testing, no-mc Restudy Related cues 21 22 0.26 0.099
1 PAL N 0.5 CR CR Testing, strong-mc Restudy Related cues 30 21 1.23 0.10
2 PAL N 0.5 CR CR Testing Restudy Related cues 36 38 0.49 0.057
3 PAL N 0.5 CR CR Testing Restudy Related cues 31 29 -0.04 0.069
Cho et al., 2017 1 PAL N 0.5 CR CR Test Restudy .20 M-C Pairs, related 41 41 0.32 0.051
1 PAL N 0.5 CR CR Test Restudy .00 M-C Pairs, related 41 41 0.08 0.050
Note. Under Stimuli, PAL = Paired associates; CAT = Categories. Under Feedback (Correct Answer Feedback), Y = Yes and N = No; (R) = Restudy; (D) =
Extended and/or detailed feedback; (E) = Explanatory feedback. Under Train and Final (Test format), FR = Free recall, CR = Cued recall, MC = Multiple-choice,
Recog. = Recognition, CST = Clinical scenario test, SPT = Simulated patient test, PST = Problem-solving test. RP = Retrieval practice, Proced. = Procedures, Probs.
= Problems, qs. = Questions, Self-expl, = Self-explanations. Under Initial test condition, (BE) = Broad encoding methods. Other abbreviations are drawn verbatim
from, and defined in, the original articles. (*) indicates data that was randomly excluded from the meta-analyses due to non-independence with other included
conditions; (**) indicates feedback was provided but did not qualify as correct answer feedback; (‡) indicates data reported collapsed across conditions. Experiment
numbers of unpublished studies may differ from that in subsequently published articles.
TESTING AND TRANSFER 120
Table 2.
Overall and Category-Level Weighted Mean Effect Sizes
Category and dataset (number of effect sizes) β SE df p 95% C.I.
Overall across all categories (k = 192) 0.40 0.046 43.016 <.00001 [0.31,0.50]
Initial test performance data available (k = 135) 0.41 0.054 29.01 <.00001 [0.30,0.52]
Test format (k = 56) 0.58 0.071 20.86 <.00001 [0.43,0.73]
Stimulus-response rearrangement (k = 33) 0.22 0.098 6.46 .066 [-0.019,0.45]
Untested materials seen during initial study (k = 17) 0.16 0.12 10.086 .20 [-0.10,0.43]
Application and inference questions (k = 41) 0.33 0.11 10.39 .013 [0.085,0.56]
Problem-solving skills (k = 18) 0.29 0.15 5.46 .10 [-0.078,0.65]
Mediator and related word cues (k = 27) 0.61 0.089 2.12 .018 [0.25,0.97]
Note. β = regression coefficient in terms of Cohen’s d; SE = standard error; df = adjusted degrees of freedom, C.I.
= confidence interval.
TESTING AND TRANSFER 121
Table 3.
Overall Random-Effects Meta-Analyses Results
Analysis type Dataset Model type Moderator variable/intercept β SE df p 95% C.I.
Single
moderator fits
All effect sizes
(k = 192)
─ Response congruency 0.30 0.081 35.55 .0006 [0.14,0.47]
Initial test
performance
data available
(k = 135)
─ Initial test performance 0.82 0.16 15.30 .0001 [0.47,1.17]
No. training repetitions 0.13 0.032 13.75 .0011 [0.06,0.20]
Response congruency 0.29 0.094 26.68 .0048 [0.096,0.48]
Simultaneous
moderator fits
All effect sizes
(k = 192)
Main effect of
all moderators,
except initial
test performance
Elaborated retrieval
practice 0.22 0.077 23.56 .0094 [0.059, 0.38]
Response congruency 0.35 0.082 35.37 .0002 [0.18, 0.51]
Intercept 0.21 0.064 21.11 .0031 [0.081, 0.35]
Initial test
performance
data available
(k = 135)
Main effect of
all moderators,
including initial
test performance
Initial test performance 0.58 0.17 15.96 .0029 [0.23, 0.93]
Elaborated retrieval
practice 0.23 0.085 17.91 .015 [0.050, 0.41]
Response congruency 0.26 0.086 23.08 .0058 [0.084, 0.44]
Intercept -0.16 0.10 12.65 .13 [-0.38, 0.58]
Note. β = regression coefficient in terms of Cohen’s d; SE = standard error; df = adjusted degrees of freedom, C.I.
= confidence interval, No. = number. The intercept is reported for all simultaneous moderator fits.
TESTING AND TRANSFER 122
Table 4.
Overall PEESE Analyses Results
Analysis type Dataset Moderator variable/intercept β t p
No moderators fitted
All effect sizes (k = 192)
Sampling variability 4.41 4.51 <.0001
Intercept 0.17 3.69 .0003
Initial test performance
data available (k = 135)
Sampling variability 5.61 4.48 <.0001
Intercept 0.013 2.41 .017
With moderators fitted
All effect sizes (k = 192)
Sampling variability 3.86 4.32 <.0001
Elaborated retrieval practice 0.18 2.65 .0088
Response congruency 0.36 6.89 <.0001
Intercept 0.015 0.33 .74
Initial test performance
data available (k = 135)
Sampling variability 4.53 3.89 .0002
Initial test performance 0.50 3.39 .0009
Elaborated retrieval practice 0.14 1.64 .10
Response congruency 0.25 4.16 <.0001
Intercept -0.30 -2.89 .0045
Table 5.
Overall Effect Size Estimates for Various Publication Bias Scenarios
Publication Bias Scenario
Unadjusted Moderate Severe Moderate Severe
Dataset Moderator variable/intercept estimate one-tailed one-tailed two-tailed two-tailed
All effect sizes (k = 192)
Elaborated retrieval practice 0.22 0.22 0.27 0.20 0.17
Response congruency 0.35 0.37 0.48 0.33 0.31
Intercept 0.21 0.12 -0.12 0.17 0.12
Initial test performance
data available (k = 135)
Initial test performance 0.58 0.60 0.70 0.53 0.46
Elaborated retrieval practice 0.23 0.24 0.27 0.21 0.19
Response congruency 0.26 0.28 0.35 0.25 0.24
Intercept -0.16 -0.26 -0.53 -0.17 -0.17
Note. Effect size estimates are in terms of Cohen’s d. All effect size estimates were derived using the selection
methods detailed in Vevea and Woods (2005) and with the suggested p-value cutoffs of .001, .01, .05, and .50 (and
for two-tailed selection methods, also .95, .99, and .999). Investigated moderators followed that of the random-
effects meta-analyses.
TESTING AND TRANSFER 124
Table 6.
Category-Level Random-Effects Meta-Analyses Results
Category Analysis type Dataset Moderator variable/intercept β SE df p 95% C.I.
Test format
Single
moderator fits
All effect sizes Between- vs. within-
(k = 56) subjects design 0.35 0.11 12.63 .0062 [0.12, 0.58]
Simultaneous
moderator fits
All effect sizes Between- vs. within-
(k = 56) subjects design 0.35 0.11 12.63 .0062 [0.12, 0.58]
Intercept 0.49 0.077 15.20 <.0001 [0.33, 0.66]
Initial test
performance
data available Between- vs. within-
(k = 46) subjects design 0.40 0.18 9.23 .048 [0.42, 0.80]
Initial test performance 0.86 0.29 9.25 .016 [0.20, 1.51]
Multiple-choice vs. not
on the initial test -0.48 0.16 5.73 .024 [-0.88, -0.093]
Response congruency 0.29 0.11 6.53 .039 [0.019, 0.56]
Intercept -0.27 0.25 7.10 .33 [-0.87, 0.33]
Stimulus-response
rearrangement
Single
moderator fits
All effect sizes Paired associates vs. non-
(k = 33) paired associates 0.66 0.053 2.41 .0031* [0.46, 0.86]
Simultaneous
moderator fits
All effect sizes Paired associates vs. non-
(k = 33) paired associates 0.66 0.053 2.41 .0031* [0.46, 0.86]
Intercept 0.063 0.054 4.62 .29 [-0.078, 0.20]
Untested materials
seen during initial
study
Single
moderator fits
All effect sizes Elaborated retrieval
(k = 17) practice 0.37 0.14 7.87 .032 [0.041, 0.70]
Simultaneous
moderator fits
All effect sizes Elaborated retrieval
(k = 17) practice 0.37 0.14 7.87 .032 [0.041, 0.70]
Intercept 0.0028 0.13 7.31 .98 [-0.29, 0.29]
(table continues)
TESTING AND TRANSFER 125
Table 6. (continued)
Category Analysis type Dataset Moderator variable/intercept β SE df p 95% C.I.
Application and
inference questions
Single
moderator fits
All effect sizes
(k = 41) No. training repetitions 0.33 0.072 3.31 .016* [0.11,0.55]
Retention interval 0.0033 0.0009 9.64 .0060 [0.0012,0.0054]
Elaborated retrieval
practice 0.35 0.13 7.97 .029 [0.046,0.66]
Simultaneous
moderator fits
All effect sizes
(k = 41) Correct answer feedback -0.49 0.074 2.62 .011* [-0.75, -0.24]
No. training repetitions 0.29 0.098 3.27 .054* [-0.095, 0.58]
Elaborated retrieval
practice 0.26 0.11 5.54 .063 [-0.019, 0.54]
Intercept 0.44 0.062 2.02 .019 [0.18, 0.71]
Problem-
solving skills
Single
moderator fits
All effect sizes Worked examples vs.
(k = 18) medical diagnosis and
treatment 0.59 0.20 5.34 .028 [0.093, 1.09]
Simultaneous
moderator fits
All effect sizes Worked examples vs.
(k = 18) medical diagnosis and
treatment 0.59 0.20 5.34 .028 [0.093, 1.09]
Intercept 0.45 0.18 2.17 .83 [-0.69, 0.78]
Note. β = regression coefficient in terms of Cohen’s d; SE = standard error; df = adjusted degrees of freedom, C.I.
= confidence interval, No. = number. An asterisk indicates that the p-value may be untrustworthy due to
insufficient degrees of freedom (< 4). The intercept is reported for all simultaneous moderator fits. No single or
simultaneous moderator fits were performed for the category of transfer to mediator and related word cues, owing
to too few studies currently available in that category.
TESTING AND TRANSFER 126
Table 7.
Category-Level PEESE Analyses Results
Analysis type Category Dataset Moderator variable/intercept β t p
No moderators
fitted
Test format
All effect sizes (k = 56) Sampling variability 5.06 2.35 .023
Intercept 0.36 4.14 <.0001
Initial test performance
data available (k = 46) Sampling variability 6.74 2.71 .0096
Intercept 0.25 2.71 .0096
Stimulus-response
rearrangement
All effect sizes (k = 33) Sampling variability 10.93 3.30 .0025
Intercept -0.12 -1.23 .2265
Untested materials
seen during initial
study
All effect sizes (k = 17) Sampling variability 6.91 2.75 .0148
Intercept -0.26 -1.91 .076
Application and
inference questions
All effect sizes (k = 41) Sampling variability 6.43 4.27 <.0001
Intercept -0.045 -0.61 .55
Problem-solving
skills
All effect sizes (k = 18) Sampling variability -6.56 -2.92 .010
Intercept 0.78 4.97 <.0001
Mediator and
related word cues
All effect sizes (k = 27) Sampling variability 3.55 1.90 .070
Intercept 0.36 2.58 .016
With moderators
fitted
Test format
All effect sizes (k = 56) Sampling variability 3.19 1.07 .29
Between- vs. within-subjects
design 0.20 0.90 .37
Intercept 0.39 4.16 <.0001
Initial test performance
data available (k = 46) Sampling variability 6.23 2.33 .025
Between- vs. within-subjects
design 0.14 0.63 .53
Initial test performance 0.97 3.45 .0013
Multiple-choice vs. not on the
initial test -0.44 -3.46 .0013
Response congruency 0.26 2.06 .046
Intercept -0.55 -2.04 .048
Stimulus-response
rearrangement
All effect sizes (k = 33) Sampling variability 4.35 1.93 .064
Paired associates vs. non-paired
associates 0.59 7.1 <.0001
Intercept -0.044 -0.70 .49
(table continues)
TESTING AND TRANSFER 127
Table 7. (continued)
Analysis type Category Dataset Moderator variable/intercept β t p
Untested materials
seen during initial
study
All effect sizes (k = 17) Sampling variability 5.68 2.46 .028
Elaborated retrieval practice 0.34 2.18 .047
Intercept -0.32 -2.54 .023
Application and
inference questions
All effect sizes (k = 41) Sampling variability 1.94 1.34 .11
Correct answer feedback -0.38 -2.37 .024
No. training repetitions 0.23 3.66 .0008
Elaborated retrieval practice 0.26 2.98 .0051
Intercept 0.28 1.63 .11
Problem-solving
skills
All effect sizes (k = 18) Sampling variability -0.039 -0.01 .99
Worked examples vs. medical
diagnosis and treatment 0.57 2.02 .061
Intercept 0.051 0.13 .90
Note. β = regression coefficient in terms of Cohen’s d. The order of the category-level PEESE analyses, number
of analyses, and the moderators investigated in these analyses, followed those used in the random-effects meta-
analyses.
TESTING AND TRANSFER 128
Table 8.
Category-Level Effect Size Estimates for Various Publication Bias Scenarios
Publication Bias Scenario
Moderator Unadjusted Moderate Severe Moderate Severe
Category Dataset variable/intercept estimate one-tailed one-tailed two-tailed two-tailed
Test format
All effect sizes Between- vs. within-
(k = 56) subjects design 0.35 0.36 0.43 0.34 0.31
Intercept 0.49 0.41 0.20 0.44 0.37
Initial test
performance data Between- vs. within-
available (k = 46) subjects design 0.40 0.42 0.46 0.40 0.38
Initial test performance 0.86 0.98 1.20 0.89 0.91
Multiple-choice vs. not
on the initial test -0.48 -0.54 -0.66 -0.48 -0.46
Response congruency 0.29 0.33 0.39 0.30 0.30
Intercept -0.28 -0.45 -0.75 -0.34 -0.41
Stimulus-response
rearrangement
All effect sizes Paired associates vs.
(k = 33) non-paired associates 0.66 0.68 0.74 0.66 0.67
Intercept 0.062 0.036
0.036 -0.036 0.052 0.036
Untested materials
seen during initial
study
All effect sizes Elaborated retrieval
(k = 17) practice 0.38 0.37 0.39 0.32 0.25
Intercept -0.0041 -0.071 -0.21 -0.014 -0.025
Application and
inference questions
All effect sizes Correct answer
(k = 41) feedback -0.49 -0.46 -0.45 -0.43 -0.35
No. training repetitions 0.27 0.27 0.29 0.26 0.24
Elaborated retrieval
practice 0.28 0.26 0.25 0.25 0.20
Intercept 0.43 0.38 0.30 0.39 0.31
Problem-solving
skills
All effect sizes Worked examples vs.
(k = 18) medical diagnosis and
treatment 0.58 0.62 0.77 0.58 0.58
Intercept 0.047 -0.0049 -0.17 0.039 0.027
Mediator and All effect sizes
related word cues (k = 27) Intercept 0.61 0.55 0.43 0.55 0.46
Note. Effect size estimates are in terms of Cohen’s d. All effect size estimates were derived using the selection
methods detailed in Vevea and Woods (2005) and with the suggested p-value cutoffs of .001, .01, .05, and .50 (and
for two-tailed selection methods, also .95, .99, and .999). Investigated moderators followed that of the random-
effects meta-analyses.