Adventures in Machine Learning

Understanding Normality Testing with the Shapiro-Wilk Test

Statistical analysis is vital in many fields and helps us understand the data we collect. One aspect of statistical analysis is normality testing, where we test if our data is normally distributed.

Shapiro-Wilk Test

Normality testing is an essential aspect of statistical analysis since many tests require the sample data to follow a normal distribution. One way to check for normality is through the Shapiro-Wilk Test.

Shapiro-Wilk Test is a hypothesis test that tests the null hypothesis that a sample comes from a normal distribution. Therefore, if the p-value is less than the significance level, we reject the null hypothesis and accept that the sample data is not normally distributed.

Purpose and Applications of the Shapiro-Wilk Test

The primary purpose of the Shapiro-Wilk Test is to test for normality. Normality tests are important in many statistical techniques like regression analysis, ANOVA, and t-tests.

Normality tests require the data to follow the normal distribution so that the test can achieve accurate results. Therefore, if our sample data is not normally distributed, we cannot use these statistical methods.

This is where the Shapiro-Wilk Test comes in; it tests whether the sample data is normally distributed so that we can use the statistical methods appropriately. The Shapiro-Wilk Test is also useful in identifying possible outliers. Outliers are values that are far from the rest of the data points, and they can have an effect on the analysis of data.

The Shapiro-Wilk Test helps us identify these values and remove them from the dataset.

Example 1: Standard Normal Distribution

Let us consider a sample data set with 1000 random values from a standard normal distribution.

To generate the dataset, we will use the numpy.random package. We will use the seed function to make the dataset reproducible.


import numpy as np
np.random.seed(25)
dataset = np.random.randn(1000)

The dataset variable holds the 1000 random values generated from a standard normal distribution. To test for normality, we will use the scipy.stats.shapiro() function.


from scipy.stats import shapiro
stat, p = shapiro(dataset)
print(f"Statistic: {stat}, p-value: {p}")

Output:

Statistic: 0.9998425841331482, p-value: 0.9856681227684021

The output shows that the statistic is 0.9998, which tells us how much the data deviates from the normal distribution. The closer the statistic is to 1, the closer the data is to normal.

The p-value is 0.985, which is greater than the significance level of 0.05. Therefore, we fail to reject the null hypothesis and accept that the dataset follows a normal distribution.

Example 2: Poisson Distribution

Let us consider a sample dataset with 1000 values generated from a Poisson distribution with mean equal to 5. To generate the dataset, we will use numpy.random package.

We will use the seed function to make the dataset reproducible.


import numpy as np
np.random.seed(25)
dataset = np.random.poisson(5, 1000)

To test for normality, we will use the scipy.stats.shapiro() function.


from scipy.stats import shapiro
stat, p = shapiro(dataset)
print(f"Statistic: {stat}, p-value: {p}")

Output:

Statistic: 0.974152147769928, p-value: 1.5604319825543653e-16

The output shows that the statistic is 0.9742, and the p-value is less than the significance level of 0.05.

Therefore, we reject the null hypothesis and conclude that the dataset does not follow a normal distribution.

Test Output and Interpretation

Test Statistic and p-value:

The test statistic shows how closely the data follows a normal distribution.

The test statistic ranges from 0 to 1, with 1 indicating that the sample data perfectly follows a normal distribution. Ideally, the closer the test statistic is to 1, the more normal the distribution is.

The p-value indicates the evidence against the null hypothesis. A small p-value means strong evidence against the null hypothesis and indicates that the data does not follow a normal distribution.

A large p-value means that the evidence against the null hypothesis is not that strong, and we cannot reject the null hypothesis.

Null Hypothesis and Conclusion:

The null hypothesis is the assumption that the sample data comes from a normal distribution.

Therefore, if the p-value is greater than the significance level, we fail to reject the null hypothesis and conclude that the sample data follows a normal distribution. On the other hand, if the p-value is less than the significance level, we reject the null hypothesis and conclude that the sample data does not follow a normal distribution.

Conclusion

Normality testing is an essential aspect of statistical analysis. The Shapiro-Wilk Test is one of the ways in which we can test for normality. It is useful in identifying possible outliers and ensuring that we can appropriately use statistical methods that require normally distributed data.

We have seen how we can test sample data using the scipy.stats.shapiro() function and interpret the output by examining the test statistic and p-value. We can also create sample datasets with different distributions to see how the Shapiro-Wilk Test behaves.

In the previous section, we discussed the Shapiro-Wilk Test and its purpose and applications in testing for normality. As a review, the Shapiro-Wilk Test is a hypothesis test that tests the null hypothesis that a sample comes from a normal distribution. In this section, we will dive deeper into two specific examples, generating sample data from a standard normal distribution and a Poisson distribution, applying the Shapiro-Wilk Test to the generated data, and interpreting the test results.

Example 1: Standard Normal Distribution

In a standard normal distribution, the mean is 0, and the standard deviation is 1.

To generate a sample dataset from a standard normal distribution, we can use the numpy.random package. We will use the seed function to make the dataset reproducible.


import numpy as np
np.random.seed(25)
dataset = np.random.randn(1000)

The dataset variable holds the 1000 random values generated from a standard normal distribution. To test for normality, we will use the scipy.stats.shapiro() function.


from scipy.stats import shapiro
stat, p = shapiro(dataset)
print(f"Statistic: {stat}, p-value: {p}")

The output shows that the statistic is 0.9998, and the p-value is 0.9857. Since the p-value is greater than the significance level of 0.05, we fail to reject the null hypothesis, and we conclude that the dataset follows a normal distribution.

Let us now examine the Shapiro-Wilk Test in detail to understand how it works:

Data Generation

We start by generating a sample dataset from a standard normal distribution using the numpy.random package. We use the seed function to ensure reproducibility of results.


import numpy as np
np.random.seed(25)
dataset = np.random.randn(1000)

The numpy.random.randn() function generates random values from a standard normal distribution with a mean of 0 and a standard deviation of 1.

Shapiro-Wilk Test

After generating the sample data from a standard normal distribution, we apply the Shapiro-Wilk Test to test for normality. We use the scipy.stats.shapiro() function, which returns the test statistic and p-value.


from scipy.stats import shapiro
stat, p = shapiro(dataset)
print(f"Statistic: {stat}, p-value: {p}")

The output shows the test statistic and the p-value. The test statistic, in this case, is 0.9998, which indicates that the data is close to a normal distribution.

The p-value is 0.9857, which indicates that the data follows a normal distribution, as it is greater than the significance level of 0.05.

Null Hypothesis and Conclusion

The null hypothesis in the Shapiro-Wilk Test is that the sample data comes from a normal distribution. If the p-value is less than the significance level, we reject the null hypothesis and conclude that the sample data does not follow a normal distribution.

In this example, the p-value is greater than the significance level; hence we fail to reject the null hypothesis and conclude that the sample data follows a normal distribution.

Example 2: Poisson Distribution

A Poisson distribution is a discrete probability distribution that represents the number of times an event occurs in a specified interval.

The average number of events that occur in a given interval is represented by the parameter lambda. We can use numpy.random package to generate sample data from a Poisson distribution.

We will use the seed function to make the dataset reproducible.


import numpy as np
np.random.seed(25)
dataset = np.random.poisson(5, 1000)

The dataset variable holds the 1000 values generated from a Poisson distribution with a mean of 5. To test for normality, we will use the scipy.stats.shapiro() function.


from scipy.stats import shapiro
stat, p = shapiro(dataset)
print(f"Statistic: {stat}, p-value: {p}")

The output shows that the test statistic is 0.9742, and the p-value is less than the significance level of 0.05. Therefore, we reject the null hypothesis and conclude that the dataset does not follow a normal distribution.

Let us now examine the Shapiro-Wilk Test in detail to understand how it works:

Data Generation

We start by generating a sample dataset from a Poisson distribution with a mean of 5. We use the numpy.random.poisson() function to generate 1000 values.


import numpy as np
np.random.seed(25)
dataset = np.random.poisson(5, 1000)

The numpy.random.poisson() function generates random values from a Poisson distribution with a specified mean.

Shapiro-Wilk Test

After generating the sample data from a Poisson distribution, we apply the Shapiro-Wilk Test to test for normality. We use the scipy.stats.shapiro() function, which returns the test statistic and p-value.


from scipy.stats import shapiro
stat, p = shapiro(dataset)
print(f"Statistic: {stat}, p-value: {p}")

The output shows the test statistic and the p-value. The test statistic, in this case, is 0.9742, which indicates that the data deviates from a normal distribution.

The p-value is less than the significance level of 0.05, indicating strong evidence against the null hypothesis that the sample data follows a normal distribution.

Null Hypothesis and Conclusion

The null hypothesis in the Shapiro-Wilk Test is that the sample data comes from a normal distribution. If the p-value is less than the significance level, we reject the null hypothesis and conclude that the sample data does not follow a normal distribution.

In this example, the p-value is less than the significance level; hence we reject the null hypothesis and conclude that the sample data does not follow a normal distribution.

In conclusion, normality testing is a crucial step in statistical analysis, and the Shapiro-Wilk Test is one way of testing for normality. We have shown how the Shapiro-Wilk Test can be applied to sample datasets generated from a standard normal distribution and a Poisson distribution, and how we can interpret the test results. Understanding how to use the Shapiro-Wilk Test is important to ensure that we use statistical methods that are appropriate for the type of data we have.

In summary, the Shapiro-Wilk Test is a hypothesis test that tests the null hypothesis that a sample comes from a normal distribution. It is crucial in testing for normality, identifying outliers, and ensuring that we can appropriately use statistical methods that require normally distributed data.

Through two examples, we showed how to generate sample data from a standard normal distribution and a Poisson distribution, applying the Shapiro-Wilk Test, and interpreting the test results. Understanding normality testing is essential to ensure the accuracy of statistical analysis, and the Shapiro-Wilk Test is a valuable tool for achieving this. Therefore, we should use it when necessary to ensure that our analysis is appropriate and accurate.

Popular Posts