Drilling Down on Level 3: How to Measure Behavior Change After Team Training
Of Kirkpatrick’s four evaluation levels, Level 3 (Behavior) is the one that matters most for team training and the one most often missing from team training evaluations. The pattern is well-documented: most organizations that conduct team training evaluate it through post-training reaction surveys (Level 1) and, at best, individual knowledge assessments (Level 2). The level that asks whether the team is actually doing anything differently three weeks, six weeks, or three months after training is rarely attempted (Alliger et al., 1997; Salas et al., 2012).
This is not a small omission. Level 3 is where the question the sponsor cares about is actually answered: did the investment change how this team works. A program that demonstrates strong Level 1 and Level 2 results but cannot produce Level 3 evidence has not yet earned the budget that paid for it. This post lays out what Level 3 means in the team training context, why it is hard, what it requires in practice, and how practitioners can plan for it before training ever begins.
What Level 3 Means for a Team
For an individual training program, Level 3 evaluates whether the trainee is performing the targeted behavior in their job. The unit of observation is the individual at work.
For a team training program, Level 3 evaluates whether the team is performing the targeted teamwork behaviors in real or high-fidelity simulated team work. The unit of observation is the team. The behaviors of interest are the ones the training was designed to develop: closed-loop communication, mutual performance monitoring, backup behavior, structured handoffs, shared situation awareness, debrief quality, and so forth, scored as observed in the team’s actual work.
The shift in unit of observation has practical consequences. An individual Level 3 evaluation can be done by a manager noting whether one person uses a new technique. A team Level 3 evaluation requires a structured observation of the team in concert: who said what to whom, what the response was, what coordination signals were present, and how the sequence compared to a defined behavioral standard. This cannot be done from a survey or from anecdote. It requires direct observation by trained observers using a defined scoring rubric.
Why Level 3 Is Hard
Level 3 evaluation is demanding for four reasons.
Trained observers are required. An untrained observer using a behavioral marker tool produces unreliable scores. Inter-rater reliability has to be established before observation begins, typically through paired observation of pilot scenarios with calibration discussions until agreement reaches an acceptable threshold (kappa or ICC values of approximately .70 or higher are commonly cited as a working standard).
Access to team work has to be scheduled. For some teams, observation of real work is feasible (operating rooms, emergency departments, certain types of meetings). For others, real work is too sensitive, too distributed, or too intermittent to observe directly, and high-fidelity simulation has to substitute. Either path requires advance commitment from the sponsor and the team.
A baseline is required. Without a pre-training observation using the same instrument, a post-training observation cannot speak to change. Many evaluations skip the baseline because it adds time and cost upfront, then are unable to interpret the post-training data they collect later.
The cadence has to extend beyond the immediate post-training window. Behaviors observed the week after training reflect short-term retention more than they reflect transfer. A defensible Level 3 evaluation includes a delayed observation, typically at 90 days, to assess whether the change persisted (Salas et al., 2008).
These requirements are why Level 3 evaluations are often scoped out of programs at the budgeting stage. The practical consequence is that the program has no defensible evidence of impact when the budget conversation comes around the next year.
What a Level 3 Evaluation Looks Like in Practice
A defensible team training Level 3 evaluation has six elements.
1. A behavioral marker tool selected before training. The tool should match the team’s context, draw on validated instruments rather than ad hoc rubrics, and define each behavior in observable terms. A separate post on this site discusses tool selection (TEAM, NOTECHS, OTAS, BARS) in detail. The selection happens before training so that the training itself can be designed around the same behaviors the evaluation will score.
2. Two or more trained observers, with established inter-rater reliability. Observers complete training on the instrument, score paired pilot scenarios, and reach an agreement threshold before the evaluation begins. Where possible, dual observation continues during the actual evaluation so reliability can be checked throughout.
3. A pre-training baseline observation. The baseline establishes the team’s current performance on the targeted behaviors. It also serves a secondary purpose: it informs the training design, because gaps observed in the baseline can sharpen the curriculum’s focus.
4. An immediate post-training observation. This observation, conducted within one to two weeks of training, measures short-term retention and skill acquisition. It is the easiest measurement to collect and the least informative on its own; it is most useful when interpreted alongside the baseline and the delayed observation.
5. A delayed observation at approximately 90 days. This is the measurement that most directly answers the sponsor’s actual question. A program that produces strong immediate post-training scores but weak 90-day scores has produced learning without transfer; the curriculum may be sound but the sustain plan is failing. A program that produces strong scores at both points has produced durable behavior change.
6. A consistent observation context across the three time points. Comparing baseline, immediate post-training, and 90-day data is only valid if the team is observed performing comparable work each time. This typically means scheduling observations during the same type of meeting, procedure, or simulated scenario. Variation in task demands across observations introduces noise that makes interpretation difficult.
What to Do When Real Work Cannot Be Observed Directly
Some teams cannot be observed in real work for confidentiality, regulatory, or logistical reasons. Executive teams, intelligence teams, certain medical teams in protected contexts, and geographically distributed teams often fall into this category. For these teams, high-fidelity simulation is the standard substitute. The simulated scenarios are designed to match the cognitive and coordination demands of the real work, scored using the same instrument, and run at the same three time points.
Simulation-based Level 3 has a useful secondary advantage: scenarios can be standardized across observation points in a way that real work cannot, which improves the validity of the comparison. The trade-off is that performance in simulation is not identical to performance in real work, and practitioners should be transparent with sponsors about this caveat.
Common Pitfalls
Three errors recur in team training Level 3 evaluations.
Substituting self-report for observation. Asking team members whether they are now using closed-loop communication produces measurement of perception, not behavior. Self-reported behavior change is weakly correlated with observed behavior change in most studies, and it is particularly unreliable for behaviors team members have been trained to value, because social desirability inflates the responses. Self-report is appropriate at Levels 1 and 2; Level 3 requires observation.
Skipping the baseline. Without baseline data, post-training observations cannot speak to change. The most common rationalization for skipping baseline is that “we know the team is not doing this now,” which substitutes the practitioner’s intuition for evidence and produces an evaluation that cannot be defended.
Stopping at the immediate post-training observation. Immediate scores are heavily inflated by short-term recall and by the social context of being observed shortly after a training event in which the behaviors were salient. The 90-day measurement is what tells the sponsor whether the investment paid off, and it is the measurement most often dropped from the plan.
Planning Backward from Level 3
The most reliable way to ensure a team training program will be evaluated at Level 3 is to plan the evaluation before designing the training. Practitioners who treat evaluation design as a downstream task end up with curricula that have no clean Level 3 read. Practitioners who define the behavioral markers, the observation cadence, and the observers in the same project plan that scopes the curriculum produce programs that can be evaluated. The two activities are inseparable in well-designed team training.
Level 3 is hard. It is also the level that produces the evidence team training has needed for decades to defend its place in the learning and development portfolio. Practitioners who skip it inherit an evidence base that is mostly Level 1 reaction data, which is exactly the evidence base that has made team training easy to cut whenever a budget tightens. The level deserves the investment.
See Also: Other Posts in This Series
References
Alliger, G. M., Tannenbaum, S. I., Bennett, W., Jr., Traver, H., & Shotland, A. (1997). A meta-analysis of the relations among training criteria. Personnel Psychology, 50(2), 341–358.
Kirkpatrick, J. D., & Kirkpatrick, W. K. (2016). Kirkpatrick’s four levels of training evaluation. ATD Press.
Salas, E., Cooke, N. J., & Rosen, M. A. (2008). On teams, teamwork, and team performance: Discoveries and developments. Human Factors, 50(3), 540–547.
Salas, E., Tannenbaum, S. I., Kraiger, K., & Smith-Jentsch, K. A. (2012). The science of training and development in organizations: What matters in practice. Psychological Science in the Public Interest, 13(2), 74–101.
