Adventures in Machine Learning

Converting Categorical Variables to Numeric in Pandas: A Comprehensive Guide

Converting Categorical Variables to Numeric in Pandas

Data analysis involves working with both numeric and categorical variables. While numeric variables are easy to work with and analyze, categorical variables require some preprocessing to make them useful.

One common preprocessing step is converting categorical variables to numeric variables. In this article, we’ll look at how to convert categorical variables to numeric in Pandas, a popular Python library for data analysis.

Converting One Categorical Variable to Numeric

Let’s begin by looking at how to convert one categorical variable to numeric. Consider the following Pandas DataFrame:

import pandas as pd
teams = ['Team A', 'Team B', 'Team C', 'Team A', 'Team A', 'Team B', 'Team C']
scores = [10, 20, 30, 15, 25, 10, 35]
df = pd.DataFrame({'team': teams, 'score': scores})

print(df)

Output:

     team  score
0  Team A     10
1  Team B     20
2  Team C     30
3  Team A     15
4  Team A     25
5  Team B     10
6  Team C     35

Suppose we want to convert the `team` column to numeric. One way to do this is to use the `pd.factorize()` function.

This function returns two arrays: the first array contains the unique values in the categorical variable, and the second array contains the corresponding numeric codes. Here’s how we can use `pd.factorize()` to convert the `team` column to numeric:

df['team_id'], _ = pd.factorize(df['team'])

print(df)

Output:

     team  score  team_id
0  Team A     10        0
1  Team B     20        1
2  Team C     30        2
3  Team A     15        0
4  Team A     25        0
5  Team B     10        1
6  Team C     35        2

Notice that we assign the first array returned by `pd.factorize()` to a new column called `team_id`. We use underscore `_` to ignore the second array returned by `pd.factorize()`.

Converting Multiple Categorical Variables to Numeric

What if we want to convert multiple categorical variables to numeric? We can apply the same technique to each variable, but this can get tedious if we have many variables.

Fortunately, Pandas provides a more convenient way to convert multiple categorical variables to numeric: the `pd.get_dummies()` function. Suppose we have a new DataFrame with two categorical variables: `color` and `fruit`:

colors = ['red', 'green', 'blue', 'green', 'red', 'red', 'blue']
fruits = ['apple', 'banana', 'banana', 'apple', 'cherry', 'apple', 'cherry']
prices = [1.00, 0.50, 0.75, 1.20, 0.80, 1.10, 0.90]
df2 = pd.DataFrame({'color': colors, 'fruit': fruits, 'price': prices})

print(df2)

Output:

   color   fruit  price
0    red   apple   1.00
1  green  banana   0.50
2   blue  banana   0.75
3  green   apple   1.20
4    red  cherry   0.80
5    red   apple   1.10
6   blue  cherry   0.90

To convert both the `color` and `fruit` columns to numeric, we can use the `pd.get_dummies()` function:

df_dummies = pd.get_dummies(df2[['color', 'fruit']])
df3 = pd.concat([df2[['price']], df_dummies], axis=1)

print(df3)

Output:

   price  color_blue  color_green  color_red  fruit_apple  fruit_banana  fruit_cherry
0   1.00           0            0          1            1             0             0
1   0.50           0            1          0            0             1             0
2   0.75           1            0          0            0             1             0
3   1.20           0            1          0            1             0             0
4   0.80           0            0          1            0             0             1
5   1.10           0            0          1            1             0             0
6   0.90           1            0          0            0             0             1

Notice that `pd.get_dummies()` created new columns for each unique value in the categorical variables and filled these columns with either `0` or `1` depending on whether the value was present in each observation.

Convert all Categorical Columns in a DataFrame to Numeric

One way to convert all categorical columns in a DataFrame to numeric is to apply the `pd.factorize` function to each categorical column individually, as we showed in the previous example. However, this method is not very efficient, particularly if the DataFrame contains many categorical columns.

A more efficient approach is to use a loop to apply the `pd.factorize` function to all categorical columns at once. Consider the following DataFrame, which contains three categorical columns and one numeric column:

import pandas as pd
df = pd.DataFrame({
    'Category1': ['A', 'B', 'B', 'A', 'C'],
    'Category2': ['X', 'Y', 'Y', 'Z', 'Z'],
    'Category3': ['M', 'N', 'M', 'N', 'N'],
    'Value': [10, 20, 30, 40, 50]
})

print(df)

Output:

  Category1 Category2 Category3  Value
0         A         X         M     10
1         B         Y         N     20
2         B         Y         M     30
3         A         Z         N     40
4         C         Z         N     50

To convert all categorical columns to numeric, we can loop through each column and apply the `pd.factorize` function:

for col in df.columns:
    if df[col].dtypes == 'object':
        df[col] = pd.factorize(df[col])[0]

print(df)

Output:

   Category1  Category2  Category3  Value
0          0          0          0     10
1          1          1          1     20
2          1          1          0     30
3          0          2          1     40
4          2          2          1     50

Notice that we check the `dtype` of each column, and if it is an object (i.e., a categorical column), we apply the `pd.factorize` function. In this way, we can quickly convert all categorical columns to numeric in a single pass.

Additional Resources

Pandas is a powerful library that provides many functions for manipulating and analyzing data. If you want to improve your proficiency with Pandas, there are many tutorials available online that can help.

Here are a few of our favorites:

  • “10 minutes to Pandas” (https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html): This guide provides a quick introduction to Pandas, and covers the most commonly used functions and operations.
  • “Data Wrangling with Pandas” (https://www.datacamp.com/courses/data-wrangling-with-pandas): This online course covers advanced Pandas techniques, including grouping, pivoting, merging, and filtering.
  • “Pandas Cheat Sheet” (https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf): This cheat sheet provides a quick reference to the most commonly used Pandas functions and operations, and is a handy resource to have on hand.

Conclusion

In this article, we explored two methods for converting categorical variables to numeric in Pandas: `pd.factorize` and `pd.get_dummies`. We also showed how to use these methods to convert multiple categorical variables to numeric in a DataFrame.

Finally, we provided additional resources for learning Pandas and improving your data analysis skills. Whether you are a beginner or an experienced data analyst, Pandas is a powerful tool that can help you work with data efficiently and effectively.

In conclusion, converting categorical variables to numeric is a necessary step in data analysis when working with datasets containing categorical variables. We learned that Pandas provides two efficient methods to accomplish this pd.factorize and pd.get_dummies.

The former can be used to convert one and multiple categorical variables, whereas the latter is better suited to convert multiple categorical variables at once. We also discovered that converting all categorical columns to numeric can be efficiently done by using a loop and applying the pd.factorize function to each column.

Finally, we provided a few resources for those interested in increasing their proficiency in Pandas. Overall, converting categorical variables to numeric is a fundamental skill for data analysts, and knowing and accurately using these Pandas functions is critical in producing accurate and valuable insights from datasets.

Popular Posts