Importance of Splitting Dataset Into Training and Testing Set
Machine learning models have become an increasingly valuable tool in data analysis and decision-making processes across various sectors. Before training a machine learning model, it is essential to split the dataset into two subsets, namely the training set and the testing set.
When building a machine learning model, the primary goal is to create a model that generalizes well and can make accurate predictions on new data.
Overfitting, where the model learns the training set’s noise and outliers, is common in models trained on the entire dataset. Such a model will perform poorly in the testing process since it has become specialized in learned data.
For this reason, splitting the dataset into the training set and the testing set is a fundamental aspect of machine learning model building. The training set is used to create the model and observed by the machine until it captures relevant patterns and relationships in the data, while the testing set is used to evaluate the model’s performance accuracy.
Part of the training set, typically 80%, is used to build the model, while the remaining 20% is used to test the model. Alternatively, the dataset is split into three subsets, including the training set, the validation set, and the testing set.
The validation set is used to fine-tune the model and monitor the overfitting of the training set.
Cross-Validation
Cross-validation is the machine learning process’s evaluation technique, used to determine how well the model generalizes the data and avoid models that underfit or overfit the training set.
The cross-validation process attempts to divide the data into equal sets, such that training and testing occur in each subset, ensuring that every observation is trained and tested. There are various types of cross-validation techniques, including the leave-one-out, k-fold, stratified k-fold, and time-series methods.
Leave-one-out cross-validation
Leave-one-out, also called LOOCV, is a cross-validation technique suitable for small datasets. In this method, the machine learning model trains on all observations, except one.
The model then tests the trained observation with the newly predicted observation to generate an evaluation score.
K-fold cross-validation
K-fold is the most popular cross-validation technique used in machine learning.
The dataset is partitioned into k equal subsets, with one subset for testing the model while the remaining subsets train the model. A sensitivity analysis test is run by dividing the dataset into as many complementary subsets to assess the model’s generalization performance.
Stratified k-fold cross-validation
Stratified k-fold is a type of k-fold cross-validation suitable for unbalanced datasets. It partitions the dataset into k folds, ensuring that the folds have approximately the same ratio of predicted outcomes as the original dataset.
The machine learning model then trains on the training folds and tests on the testing folds, ensuring that the model is generalizable beyond the training set.
Time Series Cross-Validation
The time series cross-validation technique is suitable for sequencing datasets that vary with time.
It partitions the dataset into training and testing sets; however, the training set is limited to past data, with each fold representing a unit of time. The purpose of this technique is to determine how well the model generalizes, given that it learns from past data and predicts future data.
Implementation of K-Fold Cross-Validation Using Scikit-Learn
Scikit-learn is an open-source machine learning framework that contains inbuilt cross-validation functions, including the KFold class, which creates k random samples out of the dataset and returns an iterator over these indices. Here we will train and test a logistic regression model on the iris dataset using K-fold cross-validation.
Firstly, we will import the required libraries and load the iris dataset.
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import KFold
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data
y = iris.target
We will then create an instance of the logistic regression model.
model = LogisticRegression()
We will set the number of folds to 3 and use the KFold object to split the dataset.
kf = KFold(n_splits=3)
The next step is to train and test the logistic regression model on each of the K-fold splits.
We will print the accuracy score after each model evaluation.
for train_index, test_index in kf.split(X):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
model.fit(X_train, y_train)
print("Accuracy:", model.score(X_test, y_test))
The output of the code will look as follows:
Accuracy: 0.98
Accuracy: 0.94
Accuracy: 0.98
The accuracy score in each K-fold instance is printed, generated by training a logistic regression model and testing on each K-fold split.
Conclusion:
Splitting the dataset into two subsets, the training set, and the testing set, is fundamental in the machine learning model building process. The cross-validation process helps to evaluate the model’s generalization performance and avoid underfitting and overfitting.
Various cross-validation techniques, including the k-fold, leave-one-out, stratified k-fold, and time series methods, offer a wide range of frameworks to choose from and account for various datasets’ nuances. Furthermore, the Scikit-learn machine learning framework’s collection of inbuilt cross-validation functions, such as the KFold object, simplifies the implementation of cross-validation techniques and the evaluation of models’ accuracy scores.
Cross Validation using cross_val_score()
Cross-validation techniques, such as k-fold, leave-one-out, stratified k-fold, and time-series, are critical in training and testing machine learning models to ensure their generalizability. The scikit-learn Python library provides a cross_val_score() method that automates the process of cross-validating the model and returns an array of the model’s accuracy scores for each iteration.
This method can be used to test various models on datasets and assess their generalizability.
Overview of cross_val_score() method
The cross_val_score() method is a function in the scikit-learn Python library used to evaluate a machine learning model’s performance.
It automates the process of splitting a dataset into the train and test sets and training and testing the model on each fold. The method takes four input parameters, including the model, the dataset, the labels, and the number of folds.
The returned output of the function is an array of the accuracy scores for each model evaluation.
Implementation of cross_val_score using scikit learn
Here, we will implement the cross_val_score procedure using the logistic regression model on the iris dataset.
Firstly, we import the necessary libraries.
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn.model_selection import cross_val_score
Then, we load the iris dataset, create an instance of the logistic regression model, and set the number of folds to 5.
iris = load_iris()
X = iris.data
y = iris.target
model = LogisticRegression()
cv = 5
To determine the accuracy score of the model on the dataset using cross-validation, we implement the cross_val_score method on the model instance.
scores = cross_val_score(model, X, y, cv=cv)
The cross_val_score method returns an array of accuracy scores for each of the five-fold splits.
We can print the array of scores and their mean to get an accurate assessment of the model’s performance.
print('Accuracy Scores:', scores)
print('Mean Accuracy:', np.mean(scores))
The output of the code will appear as follows:
Accuracy Scores: [0.96666667 1.
0.96666667 0.96666667 1.
]
Mean Accuracy: 0.9800000000000001
Thus, the mean accuracy of the logistic regression model on the iris dataset using five-fold cross-validation is 98%.
Practical Implications:
The downside of cross-validation is that it is computationally expensive, requiring significant computational resources, especially when dealing with datasets with large sample sizes or complex machine learning models. It is essential to consider such implications when deciding on the cross-validation technique to use and determining the computational resources needed to perform the evaluation.
Maximizing the computational resources can lead to more accurate evaluations of the machine learning models, and as such, cross-validation remains an essential procedure in the machine learning model evaluation process.
Conclusion:
In conclusion, cross-validation techniques, such as k-fold and leave-one-out, are critical in ensuring the generalizability of machine learning models.
The cross_val_score() method in the scikit-learn Python library provides an efficient way of automating model evaluation using cross-validation techniques. By implementing this method, we can assess the accuracy of the logistic regression model on the iris dataset using five-fold cross-validation, which yielded an average accuracy score of 98%.
It is important to consider the computational resources necessary to perform cross-validation when choosing the cross-validation technique to use and determining the computational resources needed for the evaluation process. In summary, cross-validation is a critical process in ensuring machine learning models’ generalizability and accuracy.
Splitting a dataset into a training and testing set is essential in building machine learning models effectively. Cross-validation techniques like k-fold, leave-one-out, stratified k-fold, and time-series methods offer a range of frameworks to choose from and account for different nuances that datasets may present.
Using cross-validation can be computationally intensive when working with larger datasets or complex models. The scikit-learn librarys cross_val_score method automates the cross-validation process and returns the models accuracy scores for each iteration.
Overall, cross-validation techniques play a crucial role in machine learning models’ accurate and effective evaluation, and should be implemented whenever possible to improve their generalizability and performance.