Adventures in Machine Learning

Mastering Machine Learning: Splitting Data with Scikit-learn

Splitting Data into Training and Testing Sets: The Importance and How to Do It

Machine learning is a popular application of artificial intelligence that involves teaching computers to learn from data without being explicitly programmed. It has revolutionized a wide range of industries, from finance and healthcare to marketing and gaming.

Despite its many benefits, machine learning is not without challenges, including errors in models that can lead to suboptimal results. One way to mitigate these errors is by splitting data into training and testing sets.

In this article, well explore the importance of this process and how to do it in Python using Scikit-learn, a popular machine learning library. Part 1: Importance of Splitting Data into Training and Testing Sets

Overfitting and Underfitting

The primary reason for splitting data into training and testing sets is to reduce the risk of overfitting and underfitting. Overfitting occurs when a model is too complex and fits the training data too closely, resulting in poor generalization to new data.

Underfitting is the opposite, where a model is too simple and cannot capture the complexity of the data, resulting in poor performance on both the training and testing sets. Splitting the data into separate sets allows us to train the model on a subset of the data and test its performance on a different subset, which can help us detect and correct for overfitting and underfitting.

Testing on Training Data

Another reason for splitting data into training and testing sets is to avoid testing on seen data. Testing on training data can lead to artificially high performance metrics, which may not generalize to new data.

When data is split into a training set and a testing set, the model is trained on the training set and evaluated on the testing set, which simulates its performance on new data.

Splitting Data Using Sklearn

Scikit-learn is a popular machine learning library in Python that provides a simple way to split data into training and testing sets. The train_test_split function takes a dataset and a split ratio (e.g., 70% for training and 30% for testing) and returns separate subsets for training and testing.

Part 2: Splitting Data into Training and Testing Sets in Python

Dataset and Pandas Dataframe

To split data into training and testing sets, well need a dataset. In Python, we can use Pandas to create a dataframe, which is a two-dimensional table that can hold data of different types.

Heres an example of creating a dataframe:

import pandas as pd

df = pd.DataFrame({‘Height’: [5.8, 5.9, 5.6, 5.7, 6.0, 5.7],

‘Weight’: [130, 140, 120, 125, 150, 135],

‘Gender’: [‘Male’, ‘Male’, ‘Female’, ‘Female’, ‘Male’, ‘Female’]})

This creates a dataframe with three columns: Height, Weight, and Gender, and six rows of data.

Input and Output Vectors

Before we split the data into training and testing sets, we need to separate the input and output vectors. The input vector contains the features that are used to predict the output, while the output vector contains the target variable that we want to predict.

In our example, the input vector would be the Height and Weight columns, and the output vector would be the Gender column. Heres how we can do this in Python:

X = df[[‘Height’, ‘Weight’]]

y = df[‘Gender’]

This creates two variables, X and y, which contain the input and output vectors, respectively.

Splitting Data Using Sklearn

Now that we have the input and output vectors, we can split the data into training and testing sets using Scikit-learns train_test_split function. Heres an example of how to do this:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

This creates four variables: X_train and y_train, which contain the training data, and X_test and y_test, which contain the testing data.

The test_size parameter specifies the proportion of the data to use for testing, and the random_state parameter ensures reproducibility by setting a seed for the random number generator.

Comparing Shapes of Different Sets

Finally, before we start building and training models, its a good idea to compare the shapes of the different sets. The shape of a dataframe is the number of rows and columns.

In our example, we can check the shape of the input and output vectors and the training and testing sets like this:

print(‘X shape:’, X.shape)

print(‘y shape:’, y.shape)

print(‘X_train shape:’, X_train.shape)

print(‘y_train shape:’, y_train.shape)

print(‘X_test shape:’, X_test.shape)

print(‘y_test shape:’, y_test.shape)

This should give us output that looks something like this:

X shape: (6, 2)

y shape: (6,)

X_train shape: (4, 2)

y_train shape: (4,)

X_test shape: (2, 2)

y_test shape: (2,)


Splitting data into training and testing sets is an essential step in machine learning that can help us avoid errors like overfitting and underfitting and test our models on unseen data. Python provides us with many useful tools for working with data, including Pandas dataframes and Scikit-learns train_test_split function.

By keeping these concepts in mind, we can create and train machine learning models that are better suited for real-world use. In conclusion, splitting data into training and testing sets is an essential step in machine learning that can help to avoid errors and test models on new data.

Overfitting and underfitting can be detected and corrected, and testing on seen data can be avoided. Scikit-learn’s train_test_split function and Python’s Pandas dataframe can be used to split data into separate sets.

The key takeaway is to keep in mind the importance of splitting data to train and test machine learning models more effectively.

Popular Posts