Adventures in Machine Learning

Unmasking Data Variability: Exploring Median Absolute Deviation (MAD) in Python

In statistical analysis, dispersion is a measure of how much the data points in a set vary from their average value. Outliers, which are data points that significantly differ from other observations, can skew the dispersion measurement.

One way to account for and minimize the effect of outliers in a dataset is through the use of median absolute deviation (MAD). In this article, we will define and discuss the usefulness of MAD, explore the formula for MAD and how to calculate it using Python.

We will also provide examples of MAD calculations for NumPy arrays and pandas data frames. We will furthermore highlight a scaling factor problem in MAD calculation and a solution to avoid this issue using Python.

Definition and Usefulness of MAD

MAD is a robust measure of dispersion that computes how far away the individual data points are from their median. It is a reliable alternative to standard deviation when a statistical sample has outliers.

By using MAD, you can obtain an accurate indication of the variability of the values in a given dataset while minimizing the influence of extreme values. MAD is also useful in determining the central point for robust regression analysis, which is an estimation of the linear relationship between two variables.

Formula for MAD and its Computation using Python

The formula for MAD is quite simple: First, calculate the median of the data set. Then, calculate the absolute deviation of each data point from the median.

Finally, obtain the median of the absolute deviations. To calculate MAD using Python, you can use the statsmodels package’s median_absolute_deviation() function, which computes MAD without relying on the data’s normal distribution.

The syntax to execute the function on a given series x is:

“` python

from statsmodels import robust

mad = robust.median_absolute_deviation(x)

“`

Example 1: Calculation of MAD for a NumPy array

To illustrate MAD calculation for a NumPy array, suppose we have an array with the following values: [5.7, 4, 2, 8.6, 7, 5.4, 3, 6, 8.1]. To calculate MAD, we can use the following code:

“`python

import numpy as np

from statsmodels import robust

x = np.array([5.7, 4, 2, 8.6, 7, 5.4, 3, 6, 8.1])

mad = robust.median_absolute_deviation(x)

“`

The resulting value of MAD is 1.4826. Example 2: Calculation of MAD for a pandas DataFrame

Pandas data frames contain columns of different types with distinct names.

To obtain the MAD of a column, we can apply the median_absolute_deviation() function or the apply() method to the column. Suppose we have a data frame with two columns, “x” and “y,” that contain the following values:

| x | y |

|—|—|

| 10 | 20 |

| 20 | 40 |

| 30 | 60 |

| 40 | 80 |

To calculate the MAD of the “y” column, we can execute the following code:

“`python

import pandas as pd

from statsmodels import robust

df = pd.DataFrame({‘x’: [10, 20, 30, 40], ‘y’: [20, 40, 60, 80]})

mad_y = df[‘y’].mad()

mad_y_alt = robust.median_absolute_deviation(df[‘y’])

“`

The resulting value of mad_y is 20, and mad_y_alt is 20.0. Both functions yield the same result.

Scaling Factor in MAD Calculation

A scaling factor is a numerical value used to adjust or standardize a data set’s variance. The scaling factor for MAD is 1.4826, which assumes that the population of the data set under consideration follows a normal distribution.

However, this scaling factor is not appropriate for non-normally distributed datasets, such as skewed data, which is common in real-life scenarios, and may lead to unreliable MAD values. To avoid these issues, we can specify the scaling factor to be 1 instead of 1.4826.

By doing so, we obtain the median of the absolute deviations in the original form, which is non-scaled. The following Python code illustrates how this can be done:

“`python

mad = robust.scale.mad(x, center=np.median, scale=1)

“`

This code computes the MAD of a data set x without using the default scaling factor.

Instead, it replaces the scale argument with 1, which provides the non-scaled MAD value.

Conclusion

MAD is a useful tool for determining the dispersion of data points that may contain outliers. By using the median instead of the mean, we can minimize the influence of extreme values when computing dispersion measures.

In Python, the statsmodels package contains a median_absolute_deviation() function that can easily compute MAD for a given dataset. Additionally, specifying the scaling factor to be 1 can provide unbiased MAD estimates for non-normal datasets.

As a robust measure of dispersion, MAD is valuable in many fields, including finance, economics, and engineering.

MAD Calculation for Multiple Columns in a DataFrame

In statistical analysis, it is common to have a dataset with multiple variables, each of which may have different dispersion patterns. Therefore, calculating the MAD for each column in a pandas DataFrame can provide valuable insights into the variability of each variable independently.

In this article, we will provide an example of how to calculate the MAD for all columns in a pandas DataFrame and discuss how this can help us to identify and analyze patterns of variability.

Example of Calculating MAD for All Columns in a Pandas DataFrame

Suppose we have a pandas data frame with the following columns: “x”, “y”, and “z,” and the following values:

| x | y | z |

|—|—|—|

| 10 | 20 | 30 |

| 20 | 40 | 60 |

| 30 | 60 | 90 |

| 40 | 80 | 120 |

We want to calculate the MAD for each column. Here’s how we can do it using Python:

“`python

import pandas as pd

from statsmodels import robust

df = pd.DataFrame({‘x’: [10, 20, 30, 40], ‘y’: [20, 40, 60, 80], ‘z’: [30, 60, 90, 120]})

mad_df = df.apply(lambda x: robust.scale.mad(x, center=’median’, scale=1))

“`

Here we use the apply() method to apply a lambda function to each column in the data frame. The lambda function calculates the MAD for each column using the robust.scale.mad() function from the statsmodels package, with the center argument set to “median” and the scale factor set to 1.

The resulting MAD values for each column are then stored in the mad_df pandas data frame. The resulting MAD values are:

| x | y | z |

|—|—|—|

| 7.4074 | 14.8148 | 22.2222 |

The MAD values show that the variability of the “z” column is twice as much as that of the “y” column and three times that of the “x” column.

Analyzing the Results

By calculating the MAD for each column, we can observe patterns of variability and understand how they compare to each other. For example, we can use the MAD to compare the variability of each column to its mean or median value.

A column with a high MAD relative to its mean or median suggests that it has a more significant amount of variability, which may indicate that it is more sensitive to outliers or external factors. Furthermore, the MAD can be used to compare the variability of different variables in a dataset.

For instance, in finance, the variability between stocks can be analyzed by comparing their MAD values. Moreover, the MAD values can be used to identify potential outliers.

An observation that deviates substantially from the median of a variable can be identified as an outlier. The MAD can tell us about the variability of the values relative to the median of each variable, and hence, we can use the MAD to identify observations that deviate substantially and may need further analysis.

Conclusion

In conclusion, calculating the MAD for each column in a pandas DataFrame can provide valuable insights into the variability of each variable independently. By comparing the MAD values of different variables, we can analyze and identify relationships between the variables and potential outliers.

Furthermore, the MAD value can be used to determine the sensitivity of a variable to external factors. The example presented in this article illustrates how easy it is to calculate MAD for all columns in a pandas data frame and how this can be used to analyze the variability of different variables.

In summary, calculating the median absolute deviation (MAD) can be a valuable tool for analyzing dispersion in a dataset. By using the median instead of the mean, we can reduce the influence of outliers and obtain an accurate indication of variability.

The statsmodels package in Python provides a simple method to compute MAD for a given dataset, and the scaling factor in MAD computation can be adjusted to estimate non-normally distributed datasets. Additionally, calculating MAD for all columns in a pandas DataFrame can help analyze patterns of variability in each variable that will provide valuable insights, such as identifying potential outliers and determining the sensitivity of variables to external factors.

Overall, studying the median absolute deviation can lead to better data analysis and understanding in various fields, such as finance, engineering, and economics.

Popular Posts