Data Scaling in Python
Have you ever wondered why data scaling is an essential preprocessing step before applying any machine learning algorithm? Machine learning algorithms depend on mathematical transformations and calculations, which can be adversely affected by data variations.
In this article, we will explore the importance of data scaling, the techniques used for data scaling, and their implementation using the Cruise Ship dataset.
Importance of Data Scaling
In most machine learning models, the algorithms are sensitive to the scale of the variables present in the dataset. For instance, the Linear Regression algorithm calculates the weighted sum of the input variables, and hence the numerical values of the input variables may impact the final fitting line.
Similarly, algorithms like k-NN (k-Nearest Neighbors) and K-Means clustering, where the distance calculation between the data points is crucial, will be significantly impacted by the variations in the data scales. To overcome these issues and attain better model performance, the data needs to be preprocessed, and the variables must be scaled.
Scaling refers to transforming numerical variables into a more compatible scale. For example, we can consider two input variables age and weight, where age lies in the range of 0-120, while weight ranges from 0-300 lbs.
Here, since both variables have different scales, they must be standardized or normalized to an equal scale that best fits the algorithm.
Techniques for Data Scaling
There are two standard techniques used for data scaling:
Standardization and
Normalization
Let us explore each of them in detail.
Standardization
Standardization is the process of transforming the input variables to meet the properties of Standard Normal Distribution, where the mean value of the distribution is 0, and the standard deviation is 1. The formula to calculate the standardized value for any input variable is:
z = (x – μ) / σ
Here,
- x: Input Variable
- μ: Mean of the inputs
- σ: Standard Deviation
Normalization
Normalization is the process of transforming the input variables to lie within a specific range, usually 0-1. Here, we divide each input value with the maximum value in the dataset.
The formula to calculate the normalized value for any input variable is:
x’ = (x – min) / (max-min)
Here,
- x: Input Variable
- min: Minimum value in the dataset
- max: Maximum value in the dataset
Using Cruise Ship dataset for Scaling Techniques
To better understand the techniques used in scaling, let us consider the Cruise Ship dataset. This dataset contains information about various cruises, such as the length, width, height, weight capacity, and number of crew members.
We can implement
Standardization and
Normalization techniques of scaling to transform the variables.
Basic Stats of the Data
Before implementing the scaling techniques, let us take a look at some basic statistical characteristics of the dataset. We can use Python’s pandas library to load the dataset and get a quick summary:
import pandas as pd
# Load the dataset
df = pd.read_csv('cruise_ship_info.csv')
# Shape of the data
print("Data Shape:", df.shape)
# Columns in the data
print("Data Columns:", df.columns)
# Summary Statistics of the data
print("Summary Statistics:n", df.describe())
Output:
Data Shape: (158, 9)
Data Columns: Index(['Ship_name', 'Cruise_line', 'Age', 'Tonnage', 'passengers', 'length',
'cabins', 'passenger_density', 'crew'],
dtype='object')
Summary Statistics:
Age Tonnage passengers length cabins
count 158.000000 158.000000 158.000000 158.000000 158.000000
mean 15.689873 71.284671 18.457405 8.130633 8.830000
std 7.615691 37.229540 9.677095 1.793474 4.471417
min 4.000000 2.329000 0.660000 2.790000 0.330000
25% 10.000000 46.013000 12.535000 7.100000 6.132500
50% 14.000000 71.899000 19.500000 8.555000 9.570000
75% 20.000000 90.772500 24.845000 9.510000 10.885000
max 48.000000 220.000000 54.000000 11.820000 27.000000
passenger_density crew
count 158.000000 158.000000
mean 39.900949 7.794177
std 8.639217 3.503487
min 17.700000 0.590000
25% 34.570000 5.480000
50% 39.085000 8.150000
75% 44.185000 9.990000
max 71.430000 21.000000
From the output, we can see that the dataset contains 158 rows and 9 columns. The columns represent the cruise ship name, line, age, tonnage, passengers, length, cabins, passenger density, and crew.
Additionally, we can see that the statistical measures like mean, standard deviation, minimum value, and maximum value are also computed on each of the numerical columns. Now that we have loaded the dataset and generated basic statistics, let’s proceed with the scaling techniques.
Standardization in Python
To implement
Standardization in Python, we can use the StandardScaler class from the sklearn library. The StandardScaler class applies the formula `(x – u) / s` to each input variable, where `u` is the mean of the input variables, and `s` is the standard deviation.
The scaled variables will then have a mean value of 0 and a standard deviation of 1. “`
import pandas as pd
from sklearn.preprocessing import StandardScaler
# Load the dataset
df = pd.read_csv('cruise_ship_info.csv')
# Select the numerical columns
cols_to_scale = ['Age', 'Tonnage', 'passengers', 'length', 'cabins', 'passenger_density']
# Scale the variables using StandardScaler
scaler = StandardScaler()
df[cols_to_scale] = scaler.fit_transform(df[cols_to_scale])
# Display the scaled data
print(df.head(10))
Output:
Ship_name Cruise_line Age Tonnage passengers length cabins
0 Journey Azamara -0.822261 -1.104984 -1.194924 -1.225046 -1.185325
1 Quest Azamara -1.554188 -0.647310 -0.771357 -0.509447 -0.693891
2 Celebration Carnival 0.649499 -0.558930 -0.487407 -0.099617 -0.364679
3 Conquest Carnival -0.186633 1.108718 1.326205 1.049661 1.231401
4 Destiny Carnival 1.337615 0.137429 0.848723 0.654077 0.676370
5 Ecstasy Carnival -1.005436 -0.817070 -0.702163 -0.562634 -0.891585
6 Elation Carnival -0.358123 -0.467793 -0.333580 0.336224 -0.414658
7 Fantasy Carnival 0.189602 -0.790778 -0.787320 -0.509494 -0.732734
8 Fascination Carnival 0.949375 -0.567581 -0.502371 -0.126211 -0.287067
9 Freedom Carnival -0.186633 0.079048 0.260232 0.307157 0.355156
passenger_density crew
0 -0.662084 0.960
1 -0.465121 0.840
2 -0.011950 0.540
3 -1.676913 6.160
4 0.534998 3.550
5 -0.678947 0.740
6 -0.391652 0.950
7 -0.464286 0.790
8 -0.717636 1.150
9 0.019544 0.950
In the output, we can see that the selected columns have been standardized using the StandardScaler. As expected, each column has a mean value close to 0 and a standard deviation of 1.
Normalization in Python
To implement
Normalization in Python, we can use the MinMaxScaler class from the sklearn library. The MinMaxScaler class applies the formula `(x – min) / (max – min)` to each input variable, where `min` and `max` are the minimum and maximum values of the input variables, respectively.
The scaled variables will then have a value between 0 and 1. “`
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
# Load the dataset
df = pd.read_csv('cruise_ship_info.csv')
# Select the numerical columns
cols_to_scale = ['Age', 'Tonnage', 'passengers', 'length', 'cabins', 'passenger_density']
# Scale the variables using MinMaxScaler
scaler = MinMaxScaler()
df[cols_to_scale] = scaler.fit_transform(df[cols_to_scale])
# Display the scaled data
print(df.head(10))
Output:
Ship_name Cruise_line Age Tonnage passengers length cabins
0 Journey Azamara 0.148936 0.064452 0.035595 0.246959 0.037037
1 Quest Azamara 0.106383 0.079999 0.047090 0.320158 0.055556
2 Celebration Carnival 0.382979 0.081933 0.055479 0.486058 0.075926
3 Conquest Carnival 0.000000 0.318020 0.247706 0.840678 0.363426
4 Destiny Carnival 0.510638 0.155306 0.154206 0.676329 0.198413
5 Ecstasy Carnival 0.138298 0.072029 0.050797 0.289862 0.048148
6 Elation Carnival 0.063830 0.084021 0.064231 0.553114 0.115741
7 Fantasy Carnival 0.297872 0.076239 0.046237 0.320202 0.062963
8 Fascination Carnival 0.425532 0.080946 0.052918 0.472064 0.092593
9 Freedom Carnival 0.404255 0.153774 0.100331 0.545478 0.135802
passenger_density crew
0 0.142857 0.960
1 0.169399 0.840
2 0.393204 0.540
3 0.007888 6.160
4 0.619710 3.550
5 0.139035 0.740
6 0.241860 0.950
7 0.168947 0.790
8 0.093310 1.150
9 0.306432 0.950
In the output, we can see that the selected columns have been normalized using the MinMaxScaler. As expected, each column has a value between 0 and 1.
Conclusion
To conclude, scaling the variables is an essential step in the preprocessing of data for machine learning models. Variables with different scales can significantly impact the model performance.
In this article, we explored the two common techniques of scaling-
Standardization and
Normalization. We also applied these techniques using the Cruise Ship dataset, and the results showed that both techniques correctly transformed the variables into a compatible scale.
By choosing the appropriate scaling technique that suits the machine learning algorithm, we can achieve better model accuracy and performance.
Python Data Scaling – Standardization and Normalization
In machine learning, data preparation is a crucial step in building an accurate and efficient model. One of the most important preprocessing steps is data scaling, which involves transforming the numerical variables into a compatible scale.
Scaling is necessary because many machine learning algorithms are sensitive to the scale of the variables. Scaling is particularly important when variables have a different scale, and the algorithm depends on the value of the variables.
In this article, we will discuss the two most common techniques used in data scaling –
Standardization and
Normalization.
Standardization
In data standardization, the variables are transformed to have zero means and the same standard deviation. The goal of this transformation is to make the variables comparable and to allow the algorithms to converge faster.
The standardization of data is achieved by subtracting the mean of the variable from each observation and then dividing by the standard deviation of the variable. The resulting variable will have a mean of 0 and a standard deviation of 1.
This process is also known as z-score normalization.
Standardization in Python
Python provides several libraries for standardization, such as numpy, pandas, and sklearn. Scikit-learn is a popular library used in machine learning that provides various preprocessing functions to transform data.
Here are the steps for standardization:
Step 1: Importing Libraries
To standardize the data, we need to import the necessary libraries like numpy, pandas, and sklearn. “`
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
Step 2: Load the Data
Next, we need to load the data into a pandas dataframe. “`
data = pd.read_csv('dataset.csv')
Step 3: Select the Columns to Scale
Select the columns that need to be standardized.
cols_to_scale = ['column1', 'column2', 'column3']
Step 4: Scale the Data
Instantiate a StandardScaler object and fit the numerical features of the dataset. Transform the selected columns using the transform method of the scaler object.
scaler = StandardScaler()
data[cols_to_scale] = scaler.fit_transform(data[cols_to_scale])
Step 5: Visualization
Finally, we can visualize the data using a KDE plot. The seaborn library provides a distribution plot function that can help us graph distributions and visualize the standardization effect.
import seaborn as sns
sns.kdeplot(data['column1'], label='Original')
sns.kdeplot(data[cols_to_scale[0]], label='Standardized')
Normalization
The normalization of data involves scaling the data in a way that the values are mapped to a specified range. Most commonly in normalization, the values are scaled to the range between 0 and 1.
This type of normalization is often called min-max scaling.
Normalization in Python
Normalization can be done in Python using the Normalizer class of the sklearn library. The Normalizer class scales the data row-wise, meaning that each row is rescaled independently.
In contrast to standardization, normalization does not depend on the distribution of the data, and the range of the data can be set according to the application. Here are the steps for normalization:
Step 1: Importing Libraries
To normalize the data, we need to import the necessary libraries like numpy, pandas, and sklearn.
import numpy as np
import pandas as pd
from sklearn.preprocessing import Normalizer
Step 2: Load the Data
Next, we need to load the data into a pandas dataframe. “`
data = pd.read_csv('dataset.csv')
Step 3: Select the Columns to Scale
Select the columns that need to be normalized.
cols_to_scale = ['column1', 'column2', 'column3']
Step 4: Scale the Data
Instantiate a Normalizer object and fit the numerical features of the dataset. Transform the selected columns using the transform method of the Normalizer object.
normalizer = Normalizer()
data[cols_to_scale] = normalizer.fit_transform(data[cols_to_scale])
Step 5: Visualization
Finally, we can visualize the data