Adventures in Machine Learning

Mastering Data Scaling in Python: Techniques and Best Practices

Data Scaling in Python

Have you ever wondered why data scaling is an essential preprocessing step before applying any machine learning algorithm? Machine learning algorithms depend on mathematical transformations and calculations, which can be adversely affected by data variations.

In this article, we will explore the importance of data scaling, the techniques used for data scaling, and their implementation using the Cruise Ship dataset.

Importance of Data Scaling

In most machine learning models, the algorithms are sensitive to the scale of the variables present in the dataset. For instance, the Linear Regression algorithm calculates the weighted sum of the input variables, and hence the numerical values of the input variables may impact the final fitting line.

Similarly, algorithms like k-NN (k-Nearest Neighbors) and K-Means clustering, where the distance calculation between the data points is crucial, will be significantly impacted by the variations in the data scales. To overcome these issues and attain better model performance, the data needs to be preprocessed, and the variables must be scaled.

Scaling refers to transforming numerical variables into a more compatible scale. For example, we can consider two input variables age and weight, where age lies in the range of 0-120, while weight ranges from 0-300 lbs.

Here, since both variables have different scales, they must be standardized or normalized to an equal scale that best fits the algorithm.

Techniques for Data Scaling

There are two standard techniques used for data scaling:

Standardization and

Normalization. Let us explore each of them in detail.

Standardization

Standardization is the process of transforming the input variables to meet the properties of Standard Normal Distribution, where the mean value of the distribution is 0, and the standard deviation is 1. The formula to calculate the standardized value for any input variable is:

z = (x – ) /

Here,

x: Input Variable

: Mean of the inputs

: Standard Deviation

Normalization

Normalization is the process of transforming the input variables to lie within a specific range, usually 0-1. Here, we divide each input value with the maximum value in the dataset.

The formula to calculate the normalized value for any input variable is:

x’ = (x – min) / (max-min)

Here,

x: Input Variable

min: Minimum value in the dataset

max: Maximum value in the dataset

Using Cruise Ship dataset for Scaling Techniques

To better understand the techniques used in scaling, let us consider the Cruise Ship dataset. This dataset contains information about various cruises, such as the length, width, height, weight capacity, and number of crew members.

We can implement

Standardization and

Normalization techniques of scaling to transform the variables.

Basic Stats of the Data

Before implementing the scaling techniques, let us take a look at some basic statistical characteristics of the dataset. We can use Python’s pandas library to load the dataset and get a quick summary:

“`

import pandas as pd

# Load the dataset

df = pd.read_csv(‘cruise_ship_info.csv’)

# Shape of the data

print(“Data Shape:”, df.shape)

# Columns in the data

print(“nData Columns:”, df.columns)

# Summary Statistics of the data

print(“nSummary Statistics:n”, df.describe())

“`

Output:

“`

Data Shape: (158, 9)

Data Columns: Index([‘Ship_name’, ‘Cruise_line’, ‘Age’, ‘Tonnage’, ‘passengers’, ‘length’,

‘cabins’, ‘passenger_density’, ‘crew’],

dtype=’object’)

Summary Statistics:

Age Tonnage passengers length cabins

count 158.000000 158.000000 158.000000 158.000000 158.000000

mean 15.689873 71.284671 18.457405 8.130633 8.830000

std 7.615691 37.229540 9.677095 1.793474 4.471417

min 4.000000 2.329000 0.660000 2.790000 0.330000

25% 10.000000 46.013000 12.535000 7.100000 6.132500

50% 14.000000 71.899000 19.500000 8.555000 9.570000

75% 20.000000 90.772500 24.845000 9.510000 10.885000

max 48.000000 220.000000 54.000000 11.820000 27.000000

passenger_density crew

count 158.000000 158.000000

mean 39.900949 7.794177

std 8.639217 3.503487

min 17.700000 0.590000

25% 34.570000 5.480000

50% 39.085000 8.150000

75% 44.185000 9.990000

max 71.430000 21.000000

“`

From the output, we can see that the dataset contains 158 rows and 9 columns. The columns represent the cruise ship name, line, age, tonnage, passengers, length, cabins, passenger density, and crew.

Additionally, we can see that the statistical measures like mean, standard deviation, minimum value, and maximum value are also computed on each of the numerical columns. Now that we have loaded the dataset and generated basic statistics, let’s proceed with the scaling techniques.

Standardization in Python

To implement

Standardization in Python, we can use the StandardScaler class from the sklearn library. The StandardScaler class applies the formula `(x – u) / s` to each input variable, where `u` is the mean of the input variables, and `s` is the standard deviation.

The scaled variables will then have a mean value of 0 and a standard deviation of 1. “`

import pandas as pd

from sklearn.preprocessing import StandardScaler

# Load the dataset

df = pd.read_csv(‘cruise_ship_info.csv’)

# Select the numerical columns

cols_to_scale = [‘Age’, ‘Tonnage’, ‘passengers’, ‘length’, ‘cabins’, ‘passenger_density’]

# Scale the variables using StandardScaler

scaler = StandardScaler()

df[cols_to_scale] = scaler.fit_transform(df[cols_to_scale])

# Display the scaled data

print(df.head(10))

“`

Output:

“`

Ship_name Cruise_line Age Tonnage passengers length cabins

0 Journey Azamara -0.822261 -1.104984 -1.194924 -1.225046 -1.185325

1 Quest Azamara -1.554188 -0.647310 -0.771357 -0.509447 -0.693891

2 Celebration Carnival 0.649499 -0.558930 -0.487407 -0.099617 -0.364679

3 Conquest Carnival -0.186633 1.108718 1.326205 1.049661 1.231401

4 Destiny Carnival 1.337615 0.137429 0.848723 0.654077 0.676370

5 Ecstasy Carnival -1.005436 -0.817070 -0.702163 -0.562634 -0.891585

6 Elation Carnival -0.358123 -0.467793 -0.333580 0.336224 -0.414658

7 Fantasy Carnival 0.189602 -0.790778 -0.787320 -0.509494 -0.732734

8 Fascination Carnival 0.949375 -0.567581 -0.502371 -0.126211 -0.287067

9 Freedom Carnival -0.186633 0.079048 0.260232 0.307157 0.355156

passenger_density crew

0 -0.662084 0.960

1 -0.465121 0.840

2 -0.011950 0.540

3 -1.676913 6.160

4 0.534998 3.550

5 -0.678947 0.740

6 -0.391652 0.950

7 -0.464286 0.790

8 -0.717636 1.150

9 0.019544 0.950

“`

In the output, we can see that the selected columns have been standardized using the StandardScaler. As expected, each column has a mean value close to 0 and a standard deviation of 1.

Normalization in Python

To implement

Normalization in Python, we can use the MinMaxScaler class from the sklearn library. The MinMaxScaler class applies the formula `(x – min) / (max – min)` to each input variable, where `min` and `max` are the minimum and maximum values of the input variables, respectively.

The scaled variables will then have a value between 0 and 1. “`

import pandas as pd

from sklearn.preprocessing import MinMaxScaler

# Load the dataset

df = pd.read_csv(‘cruise_ship_info.csv’)

# Select the numerical columns

cols_to_scale = [‘Age’, ‘Tonnage’, ‘passengers’, ‘length’, ‘cabins’, ‘passenger_density’]

# Scale the variables using MinMaxScaler

scaler = MinMaxScaler()

df[cols_to_scale] = scaler.fit_transform(df[cols_to_scale])

# Display the scaled data

print(df.head(10))

“`

Output:

“`

Ship_name Cruise_line Age Tonnage passengers length cabins

0 Journey Azamara 0.148936 0.064452 0.035595 0.246959 0.037037

1 Quest Azamara 0.106383 0.079999 0.047090 0.320158 0.055556

2 Celebration Carnival 0.382979 0.081933 0.055479 0.486058 0.075926

3 Conquest Carnival 0.000000 0.318020 0.247706 0.840678 0.363426

4 Destiny Carnival 0.510638 0.155306 0.154206 0.676329 0.198413

5 Ecstasy Carnival 0.138298 0.072029 0.050797 0.289862 0.048148

6 Elation Carnival 0.063830 0.084021 0.064231 0.553114 0.115741

7 Fantasy Carnival 0.297872 0.076239 0.046237 0.320202 0.062963

8 Fascination Carnival 0.425532 0.080946 0.052918 0.472064 0.092593

9 Freedom Carnival 0.404255 0.153774 0.100331 0.545478 0.135802

passenger_density crew

0 0.142857 0.960

1 0.169399 0.840

2 0.393204 0.540

3 0.007888 6.160

4 0.619710 3.550

5 0.139035 0.740

6 0.241860 0.950

7 0.168947 0.790

8 0.093310 1.150

9 0.306432 0.950

“`

In the output, we can see that the selected columns have been normalized using the MinMaxScaler. As expected, each column has a value between 0 and 1.

Conclusion

To conclude, scaling the variables is an essential step in the preprocessing of data for machine learning models. Variables with different scales can significantly impact the model performance.

In this article, we explored the two common techniques of scaling-

Standardization and

Normalization. We also applied these techniques using the Cruise Ship dataset, and the results showed that both techniques correctly transformed the variables into a compatible scale.

By choosing the appropriate scaling technique that suits the machine learning algorithm, we can achieve better model accuracy and performance. Python Data Scaling –

Standardization and

Normalization

In machine learning, data preparation is a crucial step in building an accurate and efficient model. One of the most important preprocessing steps is data scaling, which involves transforming the numerical variables into a compatible scale.

Scaling is necessary because many machine learning algorithms are sensitive to the scale of the variables. Scaling is particularly important when variables have a different scale, and the algorithm depends on the value of the variables.

In this article, we will discuss the two most common techniques used in data scaling –

Standardization and

Normalization.

Standardization

In data standardization, the variables are transformed to have zero means and the same standard deviation. The goal of this transformation is to make the variables comparable and to allow the algorithms to converge faster.

The standardization of data is achieved by subtracting the mean of the variable from each observation and then dividing by the standard deviation of the variable. The resulting variable will have a mean of 0 and a standard deviation of 1.

This process is also known as z-score normalization.

Standardization in Python

Python provides several libraries for standardization, such as numpy, pandas, and sklearn. Scikit-learn is a popular library used in machine learning that provides various preprocessing functions to transform data.

Here are the steps for standardization:

Step 1: Importing Libraries

To standardize the data, we need to import the necessary libraries like numpy, pandas, and sklearn. “`

import numpy as np

import pandas as pd

from sklearn.preprocessing import StandardScaler

“`

Step 2: Load the Data

Next, we need to load the data into a pandas dataframe. “`

data = pd.read_csv(‘dataset.csv’)

“`

Step 3: Select the Columns to Scale

Select the columns that need to be standardized.

“`

cols_to_scale = [‘column1’, ‘column2’, ‘column3’]

“`

Step 4: Scale the Data

Instantiate a StandardScaler object and fit the numerical features of the dataset. Transform the selected columns using the transform method of the scaler object.

“`

scaler = StandardScaler()

data[cols_to_scale] = scaler.fit_transform(data[cols_to_scale])

“`

Step 5: Visualization

Finally, we can visualize the data using a KDE plot. The seaborn library provides a distribution plot function that can help us graph distributions and visualize the standardization effect.

“`

import seaborn as sns

sns.kdeplot(data[‘column1′], label=’Original’)

sns.kdeplot(data[cols_to_scale[0]], label=’Standardized’)

“`

Normalization

The normalization of data involves scaling the data in a way that the values are mapped to a specified range. Most commonly in normalization, the values are scaled to the range between 0 and 1.

This type of normalization is often called min-max scaling.

Normalization in Python

Normalization can be done in Python using the Normalizer class of the sklearn library. The Normalizer class scales the data row-wise, meaning that each row is rescaled independently.

In contrast to standardization, normalization does not depend on the distribution of the data, and the range of the data can be set according to the application. Here are the steps for normalization:

Step 1: Importing Libraries

To normalize the data, we need to import the necessary libraries like numpy, pandas, and sklearn.

“`

import numpy as np

import pandas as pd

from sklearn.preprocessing import Normalizer

“`

Step 2: Load the Data

Next, we need to load the data into a pandas dataframe. “`

data = pd.read_csv(‘dataset.csv’)

“`

Step 3: Select the Columns to Scale

Select the columns that need to be normalized.

“`

cols_to_scale = [‘column1’, ‘column2’, ‘column3’]

“`

Step 4: Scale the Data

Instantiate a Normalizer object and fit the numerical features of the dataset. Transform the selected columns using the transform method of the Normalizer object.

“`

normalizer = Normalizer()

data[cols_to_scale] = normalizer.fit_transform(data[cols_to_scale])

“`

Step 5: Visualization

Finally, we can visualize the data

Popular Posts