Choosing a Behavioral Marker Tool: TEAM, NOTECHS, OTAS, and BARS

A Level 3 evaluation of team training depends on a defensible scoring instrument. The instrument is what converts observation into evidence: it specifies which behaviors to score, how to score them, and what counts as good or poor performance. Without a defined instrument, observation produces opinions; with one, observation produces data that can be aggregated, compared across teams, and tracked over time.

Practitioners new to team training evaluation often build their own observation rubrics from scratch. The intent is usually reasonable (to fit the rubric to the team’s specific context), but the result is almost always weaker than what the practitioner could have obtained by selecting and lightly adapting a validated instrument. A homemade rubric has no validation evidence, no inter-rater reliability data, no comparison cohort, and no published anchor for what good performance actually looks like. The numbers it produces are difficult to defend.

Four families of behavioral marker tools have strong evidence bases and are appropriate for most team training evaluation contexts. Each was developed for a specific kind of team, but all four are adaptable to a wider range of settings than the original development context. This post defines each, identifies its best fit, and offers a short selection guide.

TEAM (Team Emergency Assessment Measure)

The Team Emergency Assessment Measure (TEAM) was developed by Cooper et al. (2010) for evaluating the non-technical performance of resuscitation teams in emergency settings. The instrument has 12 items grouped into three categories (leadership, teamwork, and task management) plus a global rating, scored on a five-point scale anchored by behavioral descriptors. The complete instrument runs to a single page, which is part of its appeal: it can be applied in real time during a fast-moving resuscitation without losing fidelity.

TEAM has been validated across multiple emergency contexts (in-hospital resuscitation, simulated codes, prehospital emergency response) and in several languages. Inter-rater reliability data are well-published, with most studies reporting acceptable agreement when raters complete a short training protocol. The instrument’s brevity makes it the leading choice for teams whose work happens in short, high-tempo episodes.

Best fit: resuscitation teams, emergency response teams, rapid response teams, code teams, trauma teams, and other teams whose work is concentrated in brief high-stakes episodes.

NOTECHS (Non-Technical Skills)

NOTECHS was originally developed by Flin et al. (2003) for evaluating the non-technical skills of European airline crews. The instrument scores four behavioral categories (cooperation, leadership and managerial skills, situation awareness, and decision-making) against defined behavioral indicators. NOTECHS has since been adapted for multiple high-reliability domains, most notably as ANTS (Anaesthetists’ Non-Technical Skills) and NOTSS (Non-Technical Skills for Surgeons), and remains influential in the broader behavioral marker literature.

The instrument’s strength is its grounding in cognitive task analysis of high-reliability work and its emphasis on the cognitive (situation awareness, decision-making) as well as the social (leadership, cooperation) dimensions of team performance. For practitioners working with teams whose breakdowns center on judgment and shared awareness as much as on communication, NOTECHS and its adaptations are the natural choice.

Best fit: aviation crews, anaesthesia teams, surgical teams (via NOTSS), and other high-reliability teams in domains where the original or adapted version has been validated. The framework also generalizes well to any high-stakes coordination context where the four NOTECHS categories map cleanly onto the team’s work.

OTAS (Observational Teamwork Assessment for Surgery)

The Observational Teamwork Assessment for Surgery (OTAS) was developed by Healey et al. (2004) to evaluate teamwork in the operating room. OTAS is distinctive in two ways: it scores teamwork behaviors separately for each phase of the operation (pre-operative, intra-operative, post-operative), and it can be applied at the level of the surgical sub-team (surgeons, anaesthetists, nurses) as well as at the level of the operating room team as a whole.

The phase structure is what most differentiates OTAS from other instruments. Surgical teamwork demands change substantially across the operation; the behaviors that distinguish a high-performing pre-operative briefing differ from the behaviors that distinguish a high-performing wound closure. An instrument that scores the entire procedure with one rubric loses this granularity. OTAS preserves it, which makes the resulting feedback far more actionable.

Best fit: operating room teams, including surgical, anaesthesia, and nursing sub-teams, and contexts where work is naturally divided into phases with distinct teamwork demands.

BARS (Behaviorally Anchored Rating Scales)

Behaviorally Anchored Rating Scales (BARS), originally proposed by Smith and Kendall (1963), is a general-purpose method rather than a specific instrument. A BARS scoring scale defines each rating point with a concrete behavioral example drawn from the work being evaluated. Instead of asking an observer to rate “communication” on a five-point scale where 5 is “excellent,” a BARS scale anchors 5 with a specific behavior (“team consistently uses closed-loop confirmation on every clinical instruction”) and 1 with a specific contrasting behavior (“team frequently issues instructions without confirmation; recipient action is unclear”).

The strength of BARS is adaptability. When none of the published instruments fits a team’s context cleanly (for example, a non-clinical, non-aviation, non-surgical team in a domain without an established marker tool), a BARS instrument can be developed from a focused critical incident analysis of the team’s work. The development process requires real effort (typically interviews with experienced operators to surface concrete examples of good and poor performance, followed by iterative refinement of the anchors), but the resulting instrument is grounded in the work it evaluates and produces interpretable scores.

Best fit: any team for which no validated published instrument exists, including executive teams, professional service teams, intelligence analysis teams, and corporate cross-functional teams. BARS is also useful as a complement to a published instrument when one or two team-specific behaviors need to be scored alongside the standard set.

A Short Selection Guide

Selecting among the four can be reduced to three questions.

What is the team’s domain? If the team operates in a domain where one of the published instruments was developed (resuscitation, aviation, anaesthesia, surgery), the published instrument is almost always the right choice, because the validation work has already been done. Adapting an instrument from a different domain is harder than it sounds and typically reduces psychometric quality.

What is the structure of the team’s work? Teams whose work happens in short high-stakes episodes are well-served by TEAM. Teams whose work is divided into distinct phases benefit from OTAS or an OTAS-style phase structure. Teams whose breakdowns are primarily cognitive (judgment, situation awareness, decision-making) benefit from NOTECHS or its adaptations.

Does a published instrument exist for this team? If yes, use it and adapt minimally. If no, develop a BARS instrument grounded in critical incident analysis. Resist the temptation to write a generic teamwork rubric without anchoring; the result is unreliable scoring.

Pilot Before You Evaluate

Whichever instrument is selected, it should be piloted before being used for a formal evaluation. A pilot run on an existing team or simulation accomplishes three things: it surfaces ambiguities in how items are interpreted, it gives observers practice and lets the practitioner check inter-rater reliability before the stakes are higher, and it identifies any context-specific behaviors that should be added through a small BARS extension. A weekend of piloting prevents most of the data-quality problems that practitioners later discover when they try to interpret the formal evaluation results.

A Note on Custom Tools

The most common alternative to selecting a validated instrument is to build a custom one from scratch. The reasons for doing this are usually superficial (the published instruments “do not quite fit,” the team has unique work, the sponsor wants something proprietary) and the cost is almost always underestimated. Building a defensible instrument requires content validation, inter-rater reliability data, criterion-related evidence, and ideally cross-team comparison data. None of these are produced by the typical custom rubric. The disciplined move is to start from a validated instrument and adapt minimally; the work it saves and the credibility it preserves are usually worth far more than the proprietary feel of a custom rubric.

References

Cooper, S., Cant, R., Porter, J., Sellick, K., Somers, G., Kinsman, L., & Nestel, D. (2010). Rating medical emergency teamwork performance: Development of the Team Emergency Assessment Measure (TEAM). Resuscitation, 81(4), 446–452.

Flin, R., Martin, L., Goeters, K. M., Hörmann, H. J., Amalberti, R., Valot, C., & Nijhuis, H. (2003). Development of the NOTECHS (non-technical skills) system for assessing pilots’ CRM skills. Human Factors and Aerospace Safety, 3(2), 97–119.

Healey, A. N., Undre, S., & Vincent, C. A. (2004). Developing observational measures of performance in surgical teams. Quality and Safety in Health Care, 13(Suppl 1), i33–i40.

Smith, P. C., & Kendall, L. M. (1963). Retranslation of expectations: An approach to the construction of unambiguous anchors for rating scales. Journal of Applied Psychology, 47(2), 149–155.