Adventures in Machine Learning

Making Sense of Categorical Data with Pandas Crosstab() Function

Crosstab() Function in Pandas

Pandas is a popular library for data manipulation and analysis. One of the most useful functions in pandas is crosstab(), which allows you to create a cross-tabulation of two or more factors.

A cross-tabulation is a useful way to summarize the distribution of a categorical variable. It shows the frequency or percentage of each variable combination.

Using the crosstab() function is easy. You just need to pass two or more arrays or Series as arguments to the function.

The output is a DataFrame containing the cross-tabulation. Let’s look at an example:

import pandas as pd
df = pd.read_csv('sales.csv')
print(pd.crosstab(df['Region'], df['Product']))

This code reads a CSV file and creates a cross-tabulation of the ‘Region’ and ‘Product’ columns. The output will look something like this:

Product  A  B  C
Region         
East     10  5  2
North    15  3  8
South     6  4  7
West      8  6  1

In this example, the table shows the number of sales for each product in each region. A value of 10 in the first row and first column means that there were 10 sales of Product A in the East region.

Using normalize argument in crosstab()

The crosstab() function also supports a normalize argument that allows you to compute the percentage of each value. The normalize argument can take three values: ‘all’, ‘index’, and ‘columns’.

These values determine the denominator for the percentage calculation.

  • If you set normalize=’all’, the percentages are relative to the total count.
  • In other words, each value is expressed as a percentage of the total number of observations in the table.
  • If you set normalize=’index’, the percentages are relative to the row totals.
  • In other words, each value is expressed as a percentage of the number of observations in the same row.
  • If you set normalize=’columns’, the percentages are relative to the column totals.
  • In other words, each value is expressed as a percentage of the number of observations in the same column.

Example 1: Create Crosstab with Percentages Relative to All Values

import pandas as pd
df = pd.read_csv('sales.csv')
print(pd.crosstab(df['Region'], df['Product'], normalize='all'))
Product         A         B         C
Region                              
East     0.166667  0.083333  0.033333
North    0.250000  0.050000  0.133333
South    0.100000  0.066667  0.116667
West     0.133333  0.100000  0.016667

In this example, each value is expressed as a percentage of the total number of observations in the table. For example, the value 0.166667 in the first row and first column means that 16.67% of all observations are in the East region and correspond to Product A.

Example 2: Create Crosstab with Percentages Relative to Row Totals

import pandas as pd
df = pd.read_csv('sales.csv')
print(pd.crosstab(df['Region'], df['Product'], normalize='index'))
Product         A         B         C
Region                              
East     0.625000  0.312500  0.062500
North    0.535714  0.107143  0.357143
South    0.300000  0.200000  0.500000
West     0.470588  0.352941  0.176471

In this example, each value is expressed as a percentage of the row total. For example, the value 0.625000 in the first row and first column means that 62.50% of the observations in the East region correspond to Product A.

Example 3: Create Crosstab with Percentages Relative to Column Totals

import pandas as pd
df = pd.read_csv('sales.csv')
print(pd.crosstab(df['Region'], df['Product'], normalize='columns'))
Product         A         B         C
Region                              
East     0.250000  0.208333  0.090909
North    0.375000  0.125000  0.363636
South    0.150000  0.166667  0.318182
West     0.200000  0.500000  0.227273

In this example, each value is expressed as a percentage of the column total. For example, the value 0.250000 in the first row and first column means that 25.00% of all observations correspond to Product A in the East region.

DataFrame Creation

Importing Pandas

Before we can create a DataFrame, we need to import the pandas library. You can import pandas using the following code:

import pandas as pd

Creating a DataFrame

Once you have imported pandas, you can create a DataFrame by passing a dictionary of arrays or lists to the pd.DataFrame() function. Each key in the dictionary will become a column in the DataFrame, and each array or list will become the values in the corresponding column.

Here’s an example:

import pandas as pd
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'Dave'],
    'Age': [25, 32, 18, 47],
    'Gender': ['F', 'M', 'M', 'M'],
    'Salary': [55000, 65000, 45000, 80000]
}
df = pd.DataFrame(data)

print(df)

This code creates a DataFrame with four columns: Name, Age, Gender, and Salary. The output will look something like this:

       Name  Age Gender  Salary
0     Alice   25      F   55000
1       Bob   32      M   65000
2   Charlie   18      M   45000
3      Dave   47      M   80000

DataFrame structure

The DataFrame has rows and columns. Each column has a name, and each row has an index that starts from zero and increases by one for each row.

You can access the values in a DataFrame using the column name and index. For example, to get the value in the first row and second column (Bob’s age), you can use the following code:

print(df.loc[1, 'Age'])

This will print 32.

DataFrame visualization

When working with DataFrames, it’s often useful to visualize the data. Pandas provides several functions for this purpose.

For example, you can use the head() function to see the first few rows of a DataFrame:

print(df.head())

This will print the first five rows of the DataFrame:

       Name  Age Gender  Salary
0     Alice   25      F   55000
1       Bob   32      M   65000
2   Charlie   18      M   45000
3      Dave   47      M   80000

You can also use the tail() function to see the last few rows:

print(df.tail(2))

This will print the last two rows of the DataFrame:

      Name  Age Gender  Salary
2  Charlie   18      M   45000
3     Dave   47      M   80000

Conclusion

In conclusion, pandas is a powerful library for data manipulation and analysis. The crosstab() function is a useful tool for creating cross-tabulations of categorical variables.

You can also use pandas to create DataFrames from dictionaries of arrays or lists. With pandas, you can visualize your data and gain insights into your data quickly and easily.

3) Example DataFrame

Before exploring the crosstab() function, let’s first create a sample DataFrame to work with. The following code shows an example DataFrame:

import pandas as pd
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'Dave', 'Eva'],
    'Age': [25, 32, 18, 47, 28],
    'Gender': ['F', 'M', 'M', 'M', 'F'],
    'Country': ['USA', 'Canada', 'UK', 'USA', 'Germany'],
    'Salary': [55000, 65000, 45000, 80000, 60000]
}
df = pd.DataFrame(data)

print(df)

This code creates a DataFrame with five columns: Name, Age, Gender, Country, and Salary. The output will look something like this:

      Name  Age Gender  Country  Salary
0    Alice   25      F      USA   55000
1      Bob   32      M   Canada   65000
2  Charlie   18      M       UK   45000
3     Dave   47      M      USA   80000
4      Eva   28      F  Germany   60000

4) Example 1: Crosstab with Percentages Relative to All Values

Now, let’s create a crosstab with percentages relative to all values on the above DataFrame. We will use normalize=’all’ as the argument to create the crosstab.

Here’s the code:

import pandas as pd
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'Dave', 'Eva'],
    'Age': [25, 32, 18, 47, 28],
    'Gender': ['F', 'M', 'M', 'M', 'F'],
    'Country': ['USA', 'Canada', 'UK', 'USA', 'Germany'],
    'Salary': [55000, 65000, 45000, 80000, 60000]
}
df = pd.DataFrame(data)
crosstab = pd.crosstab(df['Gender'], df['Country'], normalize='all')

print(crosstab)

The output of the code will be:

Country   Canada   Germany        UK       USA
Gender                                        
F        0.000000  0.166667  0.000000  0.166667
M        0.166667  0.000000  0.166667  0.500000

This crosstab shows us the percentage of each combination of ‘Gender’ and ‘Country’ out of the total number of observations. We have normalized the crosstab to all values.

Interpretation of output

The output shows that there are two females, and we know their respective countries. Women represent 33.3% of all observations in this table.

In terms of gender distribution of the population, we have 50% males and 33.3% females. There are no observations for gender or country combinations not included in the original DataFrame.

This crosstab is helpful to know the overall distribution of gender and country observations in our DataFrame. By using a crosstab with normalization, we can see a percentage relative to all observations.

5) Example 2: Crosstab with Percentages Relative to Row Totals

Let’s create a crosstab with percentages relative to row totals. We will use normalize=’index’ as the argument to create the crosstab.

Here’s the code:

import pandas as pd
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'Dave', 'Eva'],
    'Age': [25, 32, 18, 47, 28],
    'Gender': ['F', 'M', 'M', 'M', 'F'],
    'Country': ['USA', 'Canada', 'UK', 'USA', 'Germany'],
    'Salary': [55000, 65000, 45000, 80000, 60000]
}
df = pd.DataFrame(data)
crosstab = pd.crosstab(df['Gender'], df['Country'], normalize='index')

print(crosstab)

The output of the code will be:

Country    Canada   Germany        UK       USA
Gender                                         
F        0.000000  0.500000  0.000000  0.500000
M        0.166667  0.000000  0.166667  0.666667

This crosstab shows us the percentage of each combination of ‘Gender’ and ‘Country’ out of the number of observations in the same row. We have normalized the crosstab to the index (rows).

Interpretation of output

The output shows that among females, 50% are from Germany, and the other 50% are from the USA. Among males, 66.7% are from the USA column.

This is a different way to show percentages than the previous example. Here, we are seeing the percentage of each value relative to row total.

6) Example 3: Crosstab with Percentages Relative to Column Totals

Let’s create a crosstab with percentages relative to column totals. We will use normalize=’columns’ as the argument to create the crosstab.

Here’s the code:

import pandas as pd
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'Dave', 'Eva'],
    'Age': [25, 32, 18, 47, 28],
    'Gender': ['F', 'M', 'M', 'M', 'F'],
    'Country': ['USA', 'Canada', 'UK', 'USA', 'Germany'],
    'Salary': [55000, 65000, 45000, 80000, 60000]
}
df = pd.DataFrame(data)
crosstab = pd.crosstab(df['Gender'], df['Country'], normalize='columns')

print(crosstab)

The output of the code will be:

Country    Canada  Germany   UK  USA
Gender                               
F             0.0      1.0  0.0  0.5
M             1.0      0.0  1.0  0.5

This crosstab shows us the percentage of each combination of ‘Gender’ and ‘Country’ out of the number of observations in the same column. We have normalized the crosstab to the columns.

Interpretation of output

The output shows that among people from Canada, all belong to the female gender. Among people from Germany, all belong to the male gender.

For the UK column, there is an equal distribution of gender within that country. In the USA column, there are 50% males and 50% females.

Conclusion

In conclusion, the crosstab() function in pandas is a powerful tool for creating cross-tabulations of categorical variables. By using the normalize argument with different values, we can compute the percentage of each value relative to different denominators.

We can visualize crosstabs to gain insights from data. Crosstabs are helpful for:

  • Determining the distribution of categorical variables.
  • Identifying relationships between categorical variables.
  • Gaining insights into the data by calculating percentages.

By using pandas, we can efficiently analyze and visualize our data, leading to data-driven decisions.

Popular Posts