Team Science

Kirkpatrick's Hierarchy Applied to Team Training

The most widely used framework for evaluating training is the four-level hierarchy proposed by Donald Kirkpatrick in the late 1950s and refined repeatedly since (Kirkpatrick & Kirkpatrick, 2016). The four levels (Reaction, Learning, Behavior, and Results) have become the default vocabulary of training evaluation across industries, from corporate learning and development to military training to medical education. Practitioners who design team training inherit this framework whether they intend to or not, because most sponsors expect evaluation to be reported in its terms.

Kirkpatrick’s framework, however, was developed for individual training. Applying it to team training without adaptation produces evaluations that are technically Kirkpatrick-compliant but substantively misleading. The framework is still useful for team training; the levels still map onto meaningful evaluation questions. The adaptation is to recognize that each level has both an individual and a team manifestation, and that team training requires the team-level read at each level, not the individual-level read.

This post walks through the four levels, names what each looks like at the team level, and identifies the most common mistakes practitioners make when adapting Kirkpatrick to team training contexts.

Level 1: Reaction

Level 1 measures how participants reacted to the training: did they find it useful, engaging, relevant, well-facilitated. The familiar instrument is the post-training survey, often called a “smile sheet” because it tends to capture warmth toward the experience more than evidence about its impact.

For team training, Level 1 has both an individual and a team meaning. The individual reaction question is the standard one: did each participant find the experience useful. The team-level reaction question is more interesting and more diagnostic: did the team, as a unit, perceive that what it learned together was relevant to the work it does together. The team-level question is rarely asked. It typically requires either a brief team debrief at the close of training or a short instrument that asks members to rate the relevance of the training to the team’s work rather than to their own learning.

Level 1 is necessary but quickly insufficient. Alliger et al. (1997) demonstrated that reaction measures correlate weakly, and sometimes negatively, with later training transfer. A program that participants enjoyed is not, on that fact alone, a program that produced behavior change. Stopping at Level 1 is the most common evaluation mistake in team training.

Level 2: Learning

Level 2 measures what participants learned: declarative knowledge, procedural skill, and (in modern adaptations) attitudinal change. The familiar instrument is the pre-test/post-test, sometimes augmented by a structured skill demonstration.

For team training, Level 2 splits cleanly. The individual learning question asks whether each member acquired the targeted attitudes, behaviors, and cognitions. The team-level learning question asks whether the team, as a unit, developed the shared mental models, transactive memory, and collective efficacy that team training is meant to build. Individual gains do not aggregate automatically into team-level gains. A team can score well on individual measures and still have poorly aligned shared mental models, because aligned mental models are a property of the team, not a sum of individual scores.

Specific instruments at this level include knowledge assessments, behavioral skill demonstrations, mental-model elicitation methods (concept mapping, similarity ratings of task concepts), and team-level surveys of collective efficacy and psychological safety. Team-level Level 2 evaluation requires aggregation rules: practitioners need evidence that members agree enough for the mean to be interpretable as a team-level property (Mathieu et al., 2008). When agreement is low, the team has not yet achieved a shared cognition, regardless of the average score.

Level 3: Behavior

Level 3 is the level that matters most for team training and the level that is most often skipped. It measures whether the trained behavior is actually used in the workplace after training ends. For team training, this means observing the team performing real or high-fidelity simulated work and scoring its behavior against the same competencies the training targeted.

Level 3 is hard. It requires trained observers, a validated behavioral marker tool, scheduled access to team work, and a comparison cadence that includes pre-training baseline, immediate post-training, and delayed observation (typically 90 days). The combination of these requirements means Level 3 is also expensive, and it is often the level sponsors try to defer or replace with a survey-based proxy. Practitioners who allow that substitution end up with no defensible evidence that the training produced behavioral change.

This level deserves a deeper treatment than space allows here; a separate post on this site walks through how to design and run a Level 3 evaluation in practice. The summary point for the present discussion is that no team training program should be considered complete unless the practitioner and sponsor have agreed, before training begins, on what Level 3 will look like.

Level 4: Results

Level 4 measures the organizational outcomes the training was meant to influence: in healthcare, patient safety indicators; in operations, error rates and rework; in product organizations, on-time delivery and quality measures; in executive teams, the quality and timeliness of strategic decisions. Level 4 is rarely measurable cleanly because organizational outcomes are influenced by many variables in addition to training, and isolating the training’s contribution typically requires either a comparison cohort or a sustained pre/post measurement window.

For team training, Level 4 has the cleanest interpretation when the team’s work has direct, measurable team-level outcomes. Surgical teams have surgical complications and turnover times. Resuscitation teams have time-to-defibrillation and survival rates. Sales teams have win rates on multi-stakeholder deals. The closer the team’s outcomes are to its own performance (rather than to broader organizational outcomes that include many teams), the more interpretable the Level 4 read.

Holton (1996) and others have argued that Kirkpatrick’s Level 4 conflates multiple distinct constructs, and that practitioners would benefit from explicit measurement models that separate training effects from confounds. For team training in particular, a useful refinement is to distinguish team performance outcomes (the team’s own measurable output) from organizational outcomes (the broader business indicators the team is one of many contributors to). The first is more attributable; the second is more meaningful but harder to defend.

Common Adaptation Errors

Three errors recur in team training evaluations that nominally use Kirkpatrick.

Reporting individual-level data as team-level evidence. Practitioners measure individual learning gains, average them across members, and report the average as evidence of team-level change. This conflates the level of measurement with the level of analysis. Average individual change is not equivalent to team change, particularly for cognition and attitude variables where the team-level construct is shared agreement, not summed individual scores.

Stopping at Level 2. A program that demonstrates individual learning gains but no behavioral or team-level effects is not a successful team training program. Sponsors often accept Level 2 as sufficient because it is what most evaluations report, but the science is unambiguous that team-level outcomes require Level 3 evidence.

Treating Level 1 as a marker of program quality. A high reaction score is not evidence the training worked; it is evidence the participants enjoyed the experience. The two are weakly correlated at best (Alliger et al., 1997), and using Level 1 as the primary evaluation moves the practitioner away from defensible evidence and toward customer-satisfaction reporting.

Using Kirkpatrick Well

Used carefully, Kirkpatrick’s framework is still serviceable for team training evaluation. The key adaptations are to define both an individual-level and a team-level question at each of the four levels, to invest disproportionately in Level 3 because that is where behavior change shows up, and to be conservative about Level 4 claims unless the team’s outcomes are cleanly attributable to the team’s own performance. Practitioners who do this consistently produce evaluations that hold up to scrutiny and that support continued investment in team training. Practitioners who do not are visible to sponsors only at Level 1, where the signal is weakest and where budget conversations are easiest to lose.

References

Alliger, G. M., Tannenbaum, S. I., Bennett, W., Jr., Traver, H., & Shotland, A. (1997). A meta-analysis of the relations among training criteria. Personnel Psychology, 50(2), 341–358.

Holton, E. F., III. (1996). The flawed four-level evaluation model. Human Resource Development Quarterly, 7(1), 5–21.

Kirkpatrick, J. D., & Kirkpatrick, W. K. (2016). Kirkpatrick’s four levels of training evaluation. ATD Press.

Mathieu, J., Maynard, M. T., Rapp, T., & Gilson, L. (2008). Team effectiveness 1997–2007: A review of recent advancements and a glimpse into the future. Journal of Management, 34(3), 410–476.

Salas, E., Tannenbaum, S. I., Kraiger, K., & Smith-Jentsch, K. A. (2012). The science of training and development in organizations: What matters in practice. Psychological Science in the Public Interest, 13(2), 74–101.