Adventures in Machine Learning

Mastering Data Scaling in Python: Techniques and Best Practices

Data Scaling in Python

Have you ever wondered why data scaling is an essential preprocessing step before applying any machine learning algorithm? Machine learning algorithms depend on mathematical transformations and calculations, which can be adversely affected by data variations.

In this article, we will explore the importance of data scaling, the techniques used for data scaling, and their implementation using the Cruise Ship dataset.

Importance of Data Scaling

In most machine learning models, the algorithms are sensitive to the scale of the variables present in the dataset. For instance, the Linear Regression algorithm calculates the weighted sum of the input variables, and hence the numerical values of the input variables may impact the final fitting line.

Similarly, algorithms like k-NN (k-Nearest Neighbors) and K-Means clustering, where the distance calculation between the data points is crucial, will be significantly impacted by the variations in the data scales. To overcome these issues and attain better model performance, the data needs to be preprocessed, and the variables must be scaled.

Scaling refers to transforming numerical variables into a more compatible scale. For example, we can consider two input variables age and weight, where age lies in the range of 0-120, while weight ranges from 0-300 lbs.

Here, since both variables have different scales, they must be standardized or normalized to an equal scale that best fits the algorithm.

Techniques for Data Scaling

There are two standard techniques used for data scaling:

Standardization and

Normalization

Let us explore each of them in detail.

Standardization

Standardization is the process of transforming the input variables to meet the properties of Standard Normal Distribution, where the mean value of the distribution is 0, and the standard deviation is 1. The formula to calculate the standardized value for any input variable is:

z = (x – μ) / σ

Here,

  • x: Input Variable
  • μ: Mean of the inputs
  • σ: Standard Deviation

Normalization

Normalization is the process of transforming the input variables to lie within a specific range, usually 0-1. Here, we divide each input value with the maximum value in the dataset.

The formula to calculate the normalized value for any input variable is:

x’ = (x – min) / (max-min)

Here,

  • x: Input Variable
  • min: Minimum value in the dataset
  • max: Maximum value in the dataset

Using Cruise Ship dataset for Scaling Techniques

To better understand the techniques used in scaling, let us consider the Cruise Ship dataset. This dataset contains information about various cruises, such as the length, width, height, weight capacity, and number of crew members.

We can implement

Standardization and

Normalization techniques of scaling to transform the variables.

Basic Stats of the Data

Before implementing the scaling techniques, let us take a look at some basic statistical characteristics of the dataset. We can use Python’s pandas library to load the dataset and get a quick summary:

import pandas as pd
# Load the dataset
df = pd.read_csv('cruise_ship_info.csv')
# Shape of the data
print("Data Shape:", df.shape)
# Columns in the data
print("Data Columns:", df.columns)
# Summary Statistics of the data
print("Summary Statistics:n", df.describe())

Output:

Data Shape: (158, 9)
Data Columns: Index(['Ship_name', 'Cruise_line', 'Age', 'Tonnage', 'passengers', 'length',
       'cabins', 'passenger_density', 'crew'],
      dtype='object')
Summary Statistics:
               Age     Tonnage  passengers      length      cabins  
count  158.000000  158.000000  158.000000  158.000000  158.000000   
mean    15.689873   71.284671   18.457405    8.130633    8.830000   
std      7.615691   37.229540    9.677095    1.793474    4.471417   
min      4.000000    2.329000    0.660000    2.790000    0.330000   
25%     10.000000   46.013000   12.535000    7.100000    6.132500   
50%     14.000000   71.899000   19.500000    8.555000    9.570000   
75%     20.000000   90.772500   24.845000    9.510000   10.885000   
max     48.000000  220.000000   54.000000   11.820000   27.000000   
       passenger_density        crew  
count         158.000000  158.000000  
mean           39.900949    7.794177  
std             8.639217    3.503487  
min            17.700000    0.590000  
25%            34.570000    5.480000  
50%            39.085000    8.150000  
75%            44.185000    9.990000  
max            71.430000   21.000000  

From the output, we can see that the dataset contains 158 rows and 9 columns. The columns represent the cruise ship name, line, age, tonnage, passengers, length, cabins, passenger density, and crew.

Additionally, we can see that the statistical measures like mean, standard deviation, minimum value, and maximum value are also computed on each of the numerical columns. Now that we have loaded the dataset and generated basic statistics, let’s proceed with the scaling techniques.

Standardization in Python

To implement

Standardization in Python, we can use the StandardScaler class from the sklearn library. The StandardScaler class applies the formula `(x – u) / s` to each input variable, where `u` is the mean of the input variables, and `s` is the standard deviation.

The scaled variables will then have a mean value of 0 and a standard deviation of 1. “`

import pandas as pd
from sklearn.preprocessing import StandardScaler
# Load the dataset
df = pd.read_csv('cruise_ship_info.csv')
# Select the numerical columns
cols_to_scale = ['Age', 'Tonnage', 'passengers', 'length', 'cabins', 'passenger_density']
# Scale the variables using StandardScaler
scaler = StandardScaler()
df[cols_to_scale] = scaler.fit_transform(df[cols_to_scale])
# Display the scaled data
print(df.head(10))

Output:

   Ship_name Cruise_line       Age   Tonnage  passengers    length    cabins  
0    Journey     Azamara -0.822261 -1.104984   -1.194924 -1.225046 -1.185325   
1      Quest     Azamara -1.554188 -0.647310   -0.771357 -0.509447 -0.693891   
2  Celebration    Carnival  0.649499 -0.558930   -0.487407 -0.099617 -0.364679   
3     Conquest    Carnival -0.186633  1.108718    1.326205  1.049661  1.231401   
4      Destiny    Carnival  1.337615  0.137429    0.848723  0.654077  0.676370   
5      Ecstasy    Carnival -1.005436 -0.817070   -0.702163 -0.562634 -0.891585   
6      Elation    Carnival -0.358123 -0.467793   -0.333580  0.336224 -0.414658   
7      Fantasy    Carnival  0.189602 -0.790778   -0.787320 -0.509494 -0.732734   
8  Fascination    Carnival  0.949375 -0.567581   -0.502371 -0.126211 -0.287067   
9      Freedom    Carnival -0.186633  0.079048    0.260232  0.307157  0.355156   
   passenger_density   crew  
0          -0.662084  0.960  
1          -0.465121  0.840  
2          -0.011950  0.540  
3          -1.676913  6.160  
4           0.534998  3.550  
5          -0.678947  0.740  
6          -0.391652  0.950  
7          -0.464286  0.790  
8          -0.717636  1.150  
9           0.019544  0.950  

In the output, we can see that the selected columns have been standardized using the StandardScaler. As expected, each column has a mean value close to 0 and a standard deviation of 1.

Normalization in Python

To implement

Normalization in Python, we can use the MinMaxScaler class from the sklearn library. The MinMaxScaler class applies the formula `(x – min) / (max – min)` to each input variable, where `min` and `max` are the minimum and maximum values of the input variables, respectively.

The scaled variables will then have a value between 0 and 1. “`

import pandas as pd
from sklearn.preprocessing import MinMaxScaler
# Load the dataset
df = pd.read_csv('cruise_ship_info.csv')
# Select the numerical columns
cols_to_scale = ['Age', 'Tonnage', 'passengers', 'length', 'cabins', 'passenger_density']
# Scale the variables using MinMaxScaler
scaler = MinMaxScaler()
df[cols_to_scale] = scaler.fit_transform(df[cols_to_scale])
# Display the scaled data
print(df.head(10))

Output:

   Ship_name Cruise_line       Age   Tonnage  passengers    length    cabins  
0    Journey     Azamara  0.148936  0.064452    0.035595  0.246959  0.037037   
1      Quest     Azamara  0.106383  0.079999    0.047090  0.320158  0.055556   
2  Celebration    Carnival  0.382979  0.081933    0.055479  0.486058  0.075926   
3     Conquest    Carnival  0.000000  0.318020    0.247706  0.840678  0.363426   
4      Destiny    Carnival  0.510638  0.155306    0.154206  0.676329  0.198413   
5      Ecstasy    Carnival  0.138298  0.072029    0.050797  0.289862  0.048148   
6      Elation    Carnival  0.063830  0.084021    0.064231  0.553114  0.115741   
7      Fantasy    Carnival  0.297872  0.076239    0.046237  0.320202  0.062963   
8  Fascination    Carnival  0.425532  0.080946    0.052918  0.472064  0.092593   
9      Freedom    Carnival  0.404255  0.153774    0.100331  0.545478  0.135802   
   passenger_density   crew  
0           0.142857  0.960  
1           0.169399  0.840  
2           0.393204  0.540  
3           0.007888  6.160  
4           0.619710  3.550  
5           0.139035  0.740  
6           0.241860  0.950  
7           0.168947  0.790  
8           0.093310  1.150  
9           0.306432  0.950  

In the output, we can see that the selected columns have been normalized using the MinMaxScaler. As expected, each column has a value between 0 and 1.

Conclusion

To conclude, scaling the variables is an essential step in the preprocessing of data for machine learning models. Variables with different scales can significantly impact the model performance.

In this article, we explored the two common techniques of scaling-

Standardization and

Normalization. We also applied these techniques using the Cruise Ship dataset, and the results showed that both techniques correctly transformed the variables into a compatible scale.

By choosing the appropriate scaling technique that suits the machine learning algorithm, we can achieve better model accuracy and performance.

Python Data Scaling – Standardization and Normalization

In machine learning, data preparation is a crucial step in building an accurate and efficient model. One of the most important preprocessing steps is data scaling, which involves transforming the numerical variables into a compatible scale.

Scaling is necessary because many machine learning algorithms are sensitive to the scale of the variables. Scaling is particularly important when variables have a different scale, and the algorithm depends on the value of the variables.

In this article, we will discuss the two most common techniques used in data scaling –

Standardization and

Normalization.

Standardization

In data standardization, the variables are transformed to have zero means and the same standard deviation. The goal of this transformation is to make the variables comparable and to allow the algorithms to converge faster.

The standardization of data is achieved by subtracting the mean of the variable from each observation and then dividing by the standard deviation of the variable. The resulting variable will have a mean of 0 and a standard deviation of 1.

This process is also known as z-score normalization.

Standardization in Python

Python provides several libraries for standardization, such as numpy, pandas, and sklearn. Scikit-learn is a popular library used in machine learning that provides various preprocessing functions to transform data.

Here are the steps for standardization:

Step 1: Importing Libraries

To standardize the data, we need to import the necessary libraries like numpy, pandas, and sklearn. “`

import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler

Step 2: Load the Data

Next, we need to load the data into a pandas dataframe. “`

data = pd.read_csv('dataset.csv')

Step 3: Select the Columns to Scale

Select the columns that need to be standardized.

cols_to_scale = ['column1', 'column2', 'column3']

Step 4: Scale the Data

Instantiate a StandardScaler object and fit the numerical features of the dataset. Transform the selected columns using the transform method of the scaler object.

scaler = StandardScaler()
data[cols_to_scale] = scaler.fit_transform(data[cols_to_scale])

Step 5: Visualization

Finally, we can visualize the data using a KDE plot. The seaborn library provides a distribution plot function that can help us graph distributions and visualize the standardization effect.

import seaborn as sns
sns.kdeplot(data['column1'], label='Original')
sns.kdeplot(data[cols_to_scale[0]], label='Standardized')

Normalization

The normalization of data involves scaling the data in a way that the values are mapped to a specified range. Most commonly in normalization, the values are scaled to the range between 0 and 1.

This type of normalization is often called min-max scaling.

Normalization in Python

Normalization can be done in Python using the Normalizer class of the sklearn library. The Normalizer class scales the data row-wise, meaning that each row is rescaled independently.

In contrast to standardization, normalization does not depend on the distribution of the data, and the range of the data can be set according to the application. Here are the steps for normalization:

Step 1: Importing Libraries

To normalize the data, we need to import the necessary libraries like numpy, pandas, and sklearn.

import numpy as np
import pandas as pd
from sklearn.preprocessing import Normalizer

Step 2: Load the Data

Next, we need to load the data into a pandas dataframe. “`

data = pd.read_csv('dataset.csv')

Step 3: Select the Columns to Scale

Select the columns that need to be normalized.

cols_to_scale = ['column1', 'column2', 'column3']

Step 4: Scale the Data

Instantiate a Normalizer object and fit the numerical features of the dataset. Transform the selected columns using the transform method of the Normalizer object.

normalizer = Normalizer()
data[cols_to_scale] = normalizer.fit_transform(data[cols_to_scale])

Step 5: Visualization

Finally, we can visualize the data

Popular Posts