Kappa Coefficient

TL;DR

Quantifies how much two or more raters agree when evaluating the same items.
Compares observed agreement to agreement expected by chance.
Commonly used to assess reliability in fields like psychology and sociology.

Definition

The Kappa coefficient is a statistical measure of inter-rater reliability representing the degree of agreement between two or more raters who are evaluating the same items.

Explanation

The Kappa coefficient quantifies agreement by comparing observed agreement (the actual matching ratings or judgments) to expected agreement (the agreement that would occur if raters made ratings randomly).
Ratings or judgments are first coded into categories (the source describes binary categories as an example, e.g., “improved” vs “not improved” or “positive” vs “negative”).
Expected agreement is calculated by assuming raters make ratings randomly without real knowledge of the items; observed agreement is calculated from the actual rater data.
According to the source, the Kappa coefficient is calculated by dividing the observed agreement by the expected agreement and subtracting that value from 1:

\kappa = 1 - \frac{P_{o}}{P_{e}}

The resulting value ranges from 0 (no agreement) to 1 (perfect agreement). Per the source, a Kappa coefficient of 0.6 or higher generally indicates a high level of agreement, while 0.4 or lower indicates a low level of agreement.

Examples

Psychology (therapy effectiveness)

Researchers evaluating a new therapy for depression gather ratings from multiple trained raters who assess participants’ symptoms before and after therapy. Raters’ assessments are coded (e.g., “improved” or “not improved”) and compared using the Kappa coefficient to determine the degree of agreement and assess data reliability.

Researchers observing interactions within a community use multiple trained raters to judge interactions (e.g., coded as “positive” or “negative”). The Kappa coefficient is used to determine the degree of agreement between raters’ judgments and to evaluate the reliability of the collected data.

Use cases

Assessing reliability of subjective judgments in psychology.
Evaluating agreement among observers in sociology.
Any discipline where multiple raters make categorical judgments and reliability of those ratings must be quantified.

Notes or pitfalls

The source describes coding ratings into binary categories as an example; proper coding of categories is required before calculating Kappa.
Expected agreement is based on the assumption that raters are rating randomly; this assumption underlies the comparison to observed agreement.
The thresholds mentioned in the source: 0.6 or higher generally indicates high agreement; 0.4 or lower generally indicates low agreement.

Inter-rater reliability
Observed agreement
Expected agreement
Binary categories