Adventures in Machine Learning

Detecting Parkinson’s Disease with Machine Learning Using Python

Parkinson’s Disease Detection Using Machine Learning

Parkinson’s Disease is a progressive neurological disorder that affects millions of people worldwide. It is characterized by tremors, rigidity, and impaired balance and coordination.

While there is no known cure for Parkinson’s Disease, early diagnosis and treatment can help manage the symptoms and slow down its progression. Thanks to advancements in technology, medical researchers can now utilize machine learning models to predict and diagnose Parkinson’s Disease.

In this article, we will discuss how to develop an ML model for Parkinson’s disease using Python and a relevant dataset.

1) Importing Necessary Libraries

To start, we need to import the necessary libraries that we will be using throughout the project.

  • Numpy is used for numerical computations.
  • Pandas is used for data manipulation and analysis.
  • Matplotlib is used to plot and visualize data.
  • Scikit-learn is a popular Python library used for data preprocessing, modeling, and evaluation.
  • Xgboost is used for gradient boosting.

The following code imports the required libraries:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import accuracy_score
from xgboost import XGBClassifier

2) Understanding Parkinson’s Disease Dataset

The dataset we will be using contains various features that help us predict Parkinson’s Disease. Some of these features include age, sex, and neurophysiological parameters such as tremor and rigidity.

After importing and cleaning the data, we can start building our ML model. Generally, Parkinson’s Disease models usually fall under two types, namely regression and classification models.

  • In regression models, we predict the continuous outcome such as tremor severity.
  • In classification models, we predict binary outcomes such as “Parkinson’s Disease” or “healthy.”

We can use various regression models such as Linear Regression, Decision Tree Regressor, or Random Forest Regressor to predict continuous outcomes. On the other hand, we can use classification models such as Logistic Regression, Decision Tree Classifier, or Random Forest Classifier to predict binary outcomes.

In our case, we will use the xgboost classifier model, as it is one of the most efficient and accurate models for classification.

3) Loading the Dataset

The first step in developing our ML model for Parkinson’s Disease is to load the dataset. The dataset we will be using contains various features and labels to help us predict Parkinson’s Disease.

We can use the pandas module to import the dataset into Python. In this case, the dataset is stored in a CSV file named “parkinsons.csv.” To import the dataset, we can use the following code:

dataset = pd.read_csv('parkinsons.csv')

Once we have loaded the dataset, we can inspect its structure and make sure it is in the correct format. The dataset contains various features such as age, sex, and various neurophysiological parameters such as tremor, rigidity, and bradykinesia.

It also contains a label column called “status,” which indicates whether the patient has Parkinson’s Disease or not. The “status” column contains binary values, where “0” indicates a healthy patient, while “1” indicates a patient with Parkinson’s Disease.

This label column is crucial for training and evaluating our ML model accurately.

4) Normalizing the data

After loading the dataset, we need to prepare the data for our ML model by normalizing it. Normalization is essential as it ensures that all variables are in the same range and prevents any one variable from having a disproportionate influence on the model.

One common technique for normalization is using the MinMaxScaler, which scales the data to a range between 0 and 1. We can use the following code to scale our data:

scaler = MinMaxScaler(feature_range=(0, 1))
scaled_dataset = scaler.fit_transform(dataset.drop(['name', 'MDVP:Jitter(%)', 'MDVP:Jitter(Abs)', 'MDVP:RAP', 'MDVP:PPQ', 'Jitter:DDP', 'status'], axis=1))
scaled_df = pd.DataFrame(scaled_dataset, columns=['MDVP:Fo(Hz)', 'MDVP:Fhi(Hz)', 'MDVP:Flo(Hz)', 'MDVP:Shimmer', 'MDVP:Shimmer(dB)', 'Shimmer:APQ3', 'Shimmer:APQ5', 'MDVP:APQ', 'Shimmer:DDA', 'NHR', 'HNR', 'RPDE', 'DFA', 'spread1', 'spread2', 'D2'])

In the code above, we first import the MinMaxScaler from the sklearn.preprocessing module.

Next, we instantiate the scaler object, specifying that we want to scale our features to a range between 0 and 1. We then fit the scaler object to our dataset, dropping the ‘name,’ ‘MDVP:Jitter(%),’ ‘MDVP:Jitter(Abs),’ ‘MDVP:RAP,’ ‘MDVP:PPQ,’ ‘Jitter:DDP,’ and ‘status’ columns from the dataset as they are not required for scaling.

Finally, we transform the dataset into a normalized format and store it in a new pandas DataFrame. It’s important to note that we do not need to scale the label column as it has binary values and doesn’t require scaling.

5) Train-Test Split of data

After normalizing our dataset, the next step in developing our ML model for Parkinson’s Disease is to split the data into training and testing sets. This will help us evaluate the performance of our model accurately and prevent overfitting.

We can use the train_test_split function from the sklearn module to split our dataset into training and testing sets. This function randomizes the data and splits it into a specified test size and training size.

In this case, we will use the 80-20 rule, where 80% of the data is used for training, and the remaining 20% is used for testing. We can use the following code to split our data into training and testing sets:

X_train, X_test, y_train, y_test = train_test_split(scaled_df, dataset['status'], test_size=0.2, random_state=42)

The code above imports the train_test_split function and splits the data into training and testing sets, with the training set comprising of 80% of the data and the testing set comprising of 20%.

We also set the random state to 42, ensuring that the same random split is generated every time the code runs.

6) Initializing the XGBClassifier and training of the model

With our training and testing sets prepared, we can now initialize the XGBClassifier and train our model. The XGBClassifier is an efficient and fast implementation of the gradient boosting algorithm, which helps improve the accuracy of our model.

We can use the following code to initialize the XGBClassifier and train our model:

xgb_classifier = XGBClassifier()
xgb_classifier.fit(X_train, y_train)

The code above imports the XGBClassifier from the xgboost module and initializes the classifier object. We then fit the X_train and y_train data to our classifier object, training the model.

After training the model, we can evaluate its performance using various metrics such as accuracy, precision, recall, and F1 score. We can use the X_test and y_test data to evaluate the performance of our model.

We can also use various techniques such as cross-validation to improve our model’s accuracy.

7) Get predictions and accuracy

Once we have trained our XGBClassifier model, we can use it to make predictions on our testing dataset and evaluate its accuracy. To obtain predictions, we can use the predict() function of our XGBClassifier object.

We can use the following code to obtain predictions and calculate our model’s accuracy:

y_pred = xgb_classifier.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

The code above imports the accuracy_score function from the sklearn.metrics module. We then use our xgb_classifier object to predict the labels for our X_test data and store it in a new y_pred variable.

Finally, we calculate our model’s accuracy by comparing the actual labels in y_test with the predicted labels in y_pred, using the accuracy_score function.

8) Conclusion

Parkinson’s Disease is a debilitating neurological disorder that affects millions of people worldwide. Early detection and treatment can significantly improve the lives of patients with Parkinson’s Disease.

In recent years, machine learning models have emerged as a powerful tool for detecting and predicting Parkinson’s Disease.

This article has outlined the steps involved in developing an ML model for Parkinson’s Disease using Python and a relevant dataset. The article has covered topics such as loading and normalizing the data, splitting the data into training and testing sets, initializing the XGBClassifier, and evaluating the model’s accuracy.

The development of an ML model for Parkinson’s Disease is a crucial step in the management and detection of this debilitating disorder. By implementing the steps outlined in this article, we can create an accurate and efficient model that can predict and diagnose Parkinson’s Disease early, improving the lives of millions of people worldwide.

Popular Posts