Adventures in Machine Learning

Measuring Inter-Rater Agreement: Understanding Cohen’s and Fleiss’ Kappa

Cohen’s Kappa: Understanding Inter-Rater Agreement

Are you considering designing a study or conducting research that involves human raters or judges? If so, you will need to consider how to measure the agreement between raters or judges.

Cohen’s Kappa (κ) is a statistical measure that can help with this task. In this article, we will explore Cohen’s Kappa and how it can be calculated and interpreted.

1. Cohen’s Kappa: Definition and Formula

Cohen’s Kappa is a statistic that measures inter-rater agreement for categorical variables.

It measures how much agreement two or more raters or judges have in the task of categorizing objects or events into mutually exclusive categories. The Kappa statistic takes into account the agreement that might have occurred by chance and calculates the agreement that remains after accounting for chance agreement.

The formula for Cohen’s Kappa is as follows:

Kappa = (Po – Pe) / (1 – Pe)


  • Po = Observed agreement between raters or judges
  • Pe = Agreement expected by chance

The value of Kappa ranges from -1 to 1. A score of 1 indicates perfect agreement, 0 indicates agreement by chance, and -1 indicates perfect disagreement.

The value of Kappa is influenced by the number of categories and the number of raters or judges involved.

2. Interpretation and Range of Values

The interpretation of the Kappa statistic depends on the value obtained. The following general guidelines can be used as a rough guide:

  • A score of less than 0 is considered as indicating poor agreement.
  • A score between 0 and 0.20 indicates slight agreement.
  • A score between 0.21 and 0.40 indicates fair agreement.
  • A score between 0.41 and 0.60 indicates moderate agreement.
  • A score between 0.61 and 0.80 indicates substantial agreement.
  • A score greater than 0.81 indicates almost perfect agreement.

It’s important to note that the interpretation of Kappa depends on the field and the context in which it is used.

Researchers should use their expertise to determine whether an observed Kappa value represents an acceptable level of agreement.

3. Importance of Accounting for Chance Agreement

The Kappa statistic is designed to account for the agreement that might occur just by chance. This is important since most studies involve a degree of chance agreement.

For example, even if two raters or judges working independently on the same data set without influencing each other might happen to agree due to chance. By accounting for chance agreement, the Kappa statistic measures the agreement that goes beyond chance.

4. Calculation of Cohen’s Kappa

Interpreting the results of the Kappa statistic is relatively simple, but calculating it can be a bit more involved, especially for large datasets.

Here we present an example of how to calculate Cohen’s Kappa using Python:

Suppose we have two raters who independently judged 200 participants across three categories (low, medium, and high risk). We can represent their judgments using a confusion matrix.

[[60, 10, 30],
 [20, 50, 20],
 [10, 30, 10]]

The first step is to calculate Po, the observed agreement between raters, by adding up the diagonal cells and dividing by the total number of judgments. Po = (60 + 50 + 10) / 200 = 0.6

Next, we need to calculate Pe, the expected agreement based on chance.

To do so, we need to find the percentage of judgments in each category for each rater and multiply them together. We add these products together for all categories and subtract the sum from 1.

Pe = (70/200 * 70/200) + (60/200 * 90/200) + (70/200 * 60/200) + (30/200 * 30/200) + (50/200 * 120/200) + (20/200 * 70/200) + (20/200 * 30/200) + (30/200 * 60/200) + (10/200 * 10/200)

Pe = 0.31

Finally, we plug in our values into the Kappa formula and get the following:

Kappa = (Po – Pe) / (1 – Pe) = (0.6 – 0.31) / (1 – 0.31) = 0.412

Our Kappa score of 0.412 indicates that there is moderate agreement between our two raters.

5. Interpreting Cohen’s Kappa Scores

Interpreting the Kappa statistic involves comparing the obtained score against the guidelines mentioned earlier. From our example, we can conclude that the two raters have moderate agreement, which means that while their judgments agreed more than by chance, there is still room for improvement.

In conclusion, Cohen’s Kappa is a widely used statistical measure that can help researchers determine the degree of agreement between raters or judges in studies that involve categorizing data. By accounting for the possibility of chance agreement, the Kappa statistic provides a more accurate measure of the actual degree of agreement.

While calculating the Kappa statistic requires some effort, it ultimately provides valuable information that can aid in making important data-driven decisions.

Fleiss’ Kappa: Understanding Inter-Rater Agreement Between Multiple Judges

While Cohen’s Kappa provides a measure of inter-rater agreement for two judges, it falls short when used with a team of many judges.

To account for inter-rater agreement when there are three or more raters or judges, Fleiss’ Kappa is the appropriate statistical measure. In this article, we’ll explore Fleiss’ Kappa in detail, including its definition and formula, comparisons with Cohen’s Kappa, and how to calculate it using the Statsmodels library.

1. Fleiss’ Kappa: Definition and Formula

Fleiss’ Kappa measures the level of agreement among three or more raters or judges by comparing the observed agreement to the agreement that would be expected purely by chance.

It varies between 0 and 1, with 0 indicating no agreement better than chance and 1 indicating perfect agreement. The formula for Fleiss’ Kappa is as follows:

Kappa = (Pe – Pc) / (1 – Pc)


  • Pe = Proportion of times all raters or judges agree
  • Pc = Proportion of times expected to agree by chance

The use of the proportion is to normalize across different numbers of raters or judges and categories.

2. Comparison with Cohen’s Kappa

While Cohen’s Kappa is used to measure inter-rater agreement between two judges, Fleiss’ Kappa applies to three or more judges and can compare inter-rater agreement across several categories.

Fleiss’ Kappa is suitable for cases when there is an uneven distribution of the number of judges assigned to each category. Cohen’s Kappa assumes a fixed number of judges for each category, while Fleiss’ Kappa accommodates variations in the number of judges who contribute to each category.

In general, Cohen’s Kappa tends to overestimate agreement relative to Fleiss’ Kappa when the number of judges is small. Fleiss’ Kappa has a higher threshold for agreement due to the possibility of chance agreement, which is higher in studies involving more judges or categories.

3. Calculation with Statsmodels Library

Fortunately, Statsmodels provides a built-in package to calculate Fleiss’ Kappa (sometimes referred to as Multi Kappa calculation) automatically.

Here’s how to use it:

First, import the Python library:

import numpy as np
from statsmodels.stats.inter_rater import fleiss_kappa

Next, create an array that represents the observations or ratings given by each judge for each participant. Ensure that the array is 2-dimensional, with each row representing a participant and each column representing a judge.

For example:

ratings = np.array([
[0, 0, 1, 2, 1],
[1, 1, 2, 2, 1],
[0, 1, 1, 0, 0],
[2, 0, 1, 1, 2],
[2, 2, 0, 0, 1]

Here, we have five participants and five judges, each rating the participants on three categories (0, 1, or 2). Once you have the ratings array in place, it’s very easy to use Statsmodels to compute Fleiss’ Kappa:


Output: 0.3550295857988166

Here, our Fleiss Kappa value is 0.355, which can be interpreted as indicating fair agreement where we can expect somewhat consistent judgments among the raters. Also, remember that Fleiss Kappa assumes that the distribution of raters across categories follows a multinomial distribution.

Therefore, if more than 30 percent of participants are assigned to any one category, then Multinomial Kappa is not reliable for the study.

In conclusion, Fleiss’ Kappa is a valuable statistical technique that enables researchers to measure inter-rater agreement when working with teams of multiple judges and several categories.

It focuses on the extent of agreement beyond chance agreement and can provide useful insights for a wide range of studies and projects. While calculating Fleiss’ Kappa involves more computation than Cohen’s Kappa, built-in libraries such as Statsmodels make it relatively easy to employ in practice.

In summary, inter-rater agreement is crucial when conducting research projects involving multiple judges or raters classifying data into categories. Cohen’s Kappa is a widely used statistic for measuring inter-rater agreement for two raters or judges, while Fleiss’ Kappa is more applicable for three or more raters.

Both Cohen’s Kappa and Fleiss’ Kappa take into account chance agreement and rely on formulas to compute a score that indicates the level of agreement between various raters or judges. While calculating these scores can be tedious, built-in libraries such as Statsmodels make it relatively easy to employ in practice.

By measuring inter-rater agreement with Cohen’s Kappa or Fleiss’ Kappa, researchers can ensure the reliability and validity of their findings.

Popular Posts