Adventures in Machine Learning

Measuring Inter-Rater Reliability with ICC in Python

How to Calculate ICC in Python and Interpret the Results

Do you ever wonder how reliable a group of judges or raters are in their assessments? Do you want to quantify the degree of consistency among them?

Look no further than the Intraclass Correlation Coefficient (ICC), a widely-used statistic in the field of inter-rater reliability. In this article, we’ll guide you through the steps of calculating ICC in Python and interpreting the results.

Installing Pingouin Package

Before we dive into the details of calculating ICC, we need to install a Python package called Pingouin. Pingouin is an open-source statistical package that provides a plethora of functions for various statistical analyses, including ICC.

To install Pingouin, simply open your terminal or command prompt and enter the following command:

Pip install pingouin

Once you have successfully installed Pingouin, you are ready to create your data and calculate ICC.

Creating the Data

Assume that you have a group of five judges who rate the same set of ten targets on a scale of 1 to 10. To create a data frame in Python that reflects this scenario, you can use the following code:

import pandas as pd

judges = [‘Judge 1’, ‘Judge 2’, ‘Judge 3’, ‘Judge 4’, ‘Judge 5’]

targets = [‘Target 1’, ‘Target 2’, ‘Target 3’, ‘Target 4’, ‘Target 5’, ‘Target 6’, ‘Target 7’, ‘Target 8’, ‘Target 9’, ‘Target 10’]

ratings = [[7, 8, 7, 9, 8], [5, 4, 5, 4, 6], [9, 10, 8, 9, 9], [6, 7, 6, 8, 7], [8, 8, 9, 8, 9], [7, 7, 8, 7, 7], [4, 5, 4, 4, 5], [6, 7, 8, 6, 7], [9, 10, 9, 10, 10], [5, 6, 5, 7, 6]]

df = pd.DataFrame(ratings, columns=judges, index=targets)


This code should output the following:

Judge 1 Judge 2 Judge 3 Judge 4 Judge 5

Target 1 7 5 9 6 8

Target 2 8 4 10 7 8

Target 3 7 5 8 6 9

Target 4 9 4 9 8 8

Target 5 8 6 9 7 9

Target 6 7 7 8 7 7

Target 7 4 5 4 4 5

Target 8 6 7 8 6 7

Target 9 9 10 9 10 10

Target 10 5 6 5 7 6

Calculating the ICC

Now that we have our data in a DataFrame, we can use the Pingouin package to calculate the ICC. The function we’ll use is called “intraclass_corr()” and its parameters comprise the DataFrame, the ICC type, and the raters and rating columns.

There are three common ICC types: ICC1, ICC2, and ICC3. ICC1 assumes that the raters are the only source of variability, ICC2 incorporates the targets into the variability, and ICC3 assumes that both raters and targets contribute to the variability.

For this example, we’ll use ICC3. Here is the code for calculating ICC:

from pingouin import intraclass_corr

icc = intraclass_corr(data=df, targets=’Targets’, raters=’Raters’, ratings=’Ratings’, kind=’icc3′)


This code should output the following:

Type ICC F df1 df2 pval CI95% r

0 ICC3 0.82024 9.0951 4.00 45 0.000e+00 0.6355 0.90151

Interpreting the Results

Now that we have computed the ICC, let’s talk about how to interpret the results. Generally, the ICC ranges from 0 to 1, where 0 represents no reliability and 1 represents perfect reliability.

In practice, ICC values of 0.70 or above are considered acceptable for research purposes. In this example, the ICC3 value of 0.82024 indicates a high degree of consistency among the raters.

It’s worth noting that there are variations of ICC calculations beyond the ICC1, ICC2 and ICC3 examples used in this article. Specifically, there are additional variations of each version that can be useful for specific applications.

To further explore these advanced ICC variations (ICC1k, ICC2k, and ICC3k), readers are encouraged to explore additional resources.


In summary, the Intraclass Correlation Coefficient is a valuable tool for assessing inter-rater reliability and consistency. By using Pingouin in Python, we can easily calculate the ICC and interpret the results.

Keep in mind that the ICC value should be evaluated against the level of reliability required for your specific research question, and there are advanced variations of ICC that may be suited for specific use cases. Intraclass Correlation Coefficient (ICC) is an important statistical tool that measures the consistency and reliability of judgments among raters.

In this article, we learned how to calculate ICC in Python using the Pingouin package, step-by-step. We created our data and used the intraclass_corr() function to get the ICC3 value which helps to interpret the level of consistency among raters.

It is essential to keep in mind that the ICC value should be evaluated against the required reliability level for the research question, and there are advanced variations of ICC that may be suited for specific use cases. The key takeaway is that ICC measures the consistency of judgments, and researchers should use it to ensure their research is reliable and valid.

Popular Posts