Adventures in Machine Learning

Making Sense of Categorical Data with Pandas Crosstab() Function

Crosstab() Function in Pandas

Pandas is a popular library for data manipulation and analysis. One of the most useful functions in pandas is crosstab(), which allows you to create a cross-tabulation of two or more factors.

A cross-tabulation is a useful way to summarize the distribution of a categorical variable. It shows the frequency or percentage of each variable combination.

Using the crosstab() function is easy. You just need to pass two or more arrays or Series as arguments to the function.

The output is a DataFrame containing the cross-tabulation. Let’s look at an example:

“`python

import pandas as pd

df = pd.read_csv(‘sales.csv’)

print(pd.crosstab(df[‘Region’], df[‘Product’]))

“`

This code reads a CSV file and creates a cross-tabulation of the ‘Region’ and ‘Product’ columns. The output will look something like this:

“`

Product A B C

Region

East 10 5 2

North 15 3 8

South 6 4 7

West 8 6 1

“`

In this example, the table shows the number of sales for each product in each region. A value of 10 in the first row and first column means that there were 10 sales of Product A in the East region.

Using normalize argument in crosstab()

The crosstab() function also supports a normalize argument that allows you to compute the percentage of each value. The normalize argument can take three values: ‘all’, ‘index’, and ‘columns’.

These values determine the denominator for the percentage calculation.

– If you set normalize=’all’, the percentages are relative to the total count.

In other words, each value is expressed as a percentage of the total number of observations in the table. – If you set normalize=’index’, the percentages are relative to the row totals.

In other words, each value is expressed as a percentage of the number of observations in the same row. – If you set normalize=’columns’, the percentages are relative to the column totals.

In other words, each value is expressed as a percentage of the number of observations in the same column. Example 1: Create Crosstab with Percentages Relative to All Values

“`python

import pandas as pd

df = pd.read_csv(‘sales.csv’)

print(pd.crosstab(df[‘Region’], df[‘Product’], normalize=’all’))

“`

“`

Product A B C

Region

East 0.166667 0.083333 0.033333

North 0.250000 0.050000 0.133333

South 0.100000 0.066667 0.116667

West 0.133333 0.100000 0.016667

“`

In this example, each value is expressed as a percentage of the total number of observations in the table. For example, the value 0.166667 in the first row and first column means that 16.67% of all observations are in the East region and correspond to Product A.

Example 2: Create Crosstab with Percentages Relative to Row Totals

“`python

import pandas as pd

df = pd.read_csv(‘sales.csv’)

print(pd.crosstab(df[‘Region’], df[‘Product’], normalize=’index’))

“`

“`

Product A B C

Region

East 0.625000 0.312500 0.062500

North 0.535714 0.107143 0.357143

South 0.300000 0.200000 0.500000

West 0.470588 0.352941 0.176471

“`

In this example, each value is expressed as a percentage of the row total. For example, the value 0.625000 in the first row and first column means that 62.50% of the observations in the East region correspond to Product A.

Example 3: Create Crosstab with Percentages Relative to Column Totals

“`python

import pandas as pd

df = pd.read_csv(‘sales.csv’)

print(pd.crosstab(df[‘Region’], df[‘Product’], normalize=’columns’))

“`

“`

Product A B C

Region

East 0.250000 0.208333 0.090909

North 0.375000 0.125000 0.363636

South 0.150000 0.166667 0.318182

West 0.200000 0.500000 0.227273

“`

In this example, each value is expressed as a percentage of the column total. For example, the value 0.250000 in the first row and first column means that 25.00% of all observations correspond to Product A in the East region.

DataFrame Creation

Importing Pandas

Before we can create a DataFrame, we need to import the pandas library. You can import pandas using the following code:

“`python

import pandas as pd

“`

Creating a DataFrame

Once you have imported pandas, you can create a DataFrame by passing a dictionary of arrays or lists to the pd.DataFrame() function. Each key in the dictionary will become a column in the DataFrame, and each array or list will become the values in the corresponding column.

Here’s an example:

“`python

import pandas as pd

data = {

‘Name’: [‘Alice’, ‘Bob’, ‘Charlie’, ‘Dave’],

‘Age’: [25, 32, 18, 47],

‘Gender’: [‘F’, ‘M’, ‘M’, ‘M’],

‘Salary’: [55000, 65000, 45000, 80000]

}

df = pd.DataFrame(data)

print(df)

“`

This code creates a DataFrame with four columns: Name, Age, Gender, and Salary. The output will look something like this:

“`

Name Age Gender Salary

0 Alice 25 F 55000

1 Bob 32 M 65000

2 Charlie 18 M 45000

3 Dave 47 M 80000

“`

DataFrame structure

The DataFrame has rows and columns. Each column has a name, and each row has an index that starts from zero and increases by one for each row.

You can access the values in a DataFrame using the column name and index. For example, to get the value in the first row and second column (Bob’s age), you can use the following code:

“`python

print(df.loc[1, ‘Age’])

“`

This will print 32.

DataFrame visualization

When working with DataFrames, it’s often useful to visualize the data. Pandas provides several functions for this purpose.

For example, you can use the head() function to see the first few rows of a DataFrame:

“`python

print(df.head())

“`

This will print the first five rows of the DataFrame:

“`

Name Age Gender Salary

0 Alice 25 F 55000

1 Bob 32 M 65000

2 Charlie 18 M 45000

3 Dave 47 M 80000

“`

You can also use the tail() function to see the last few rows:

“`python

print(df.tail(2))

“`

This will print the last two rows of the DataFrame:

“`

Name Age Gender Salary

2 Charlie 18 M 45000

3 Dave 47 M 80000

“`

Conclusion

In conclusion, pandas is a powerful library for data manipulation and analysis. The crosstab() function is a useful tool for creating cross-tabulations of categorical variables.

You can also use pandas to create DataFrames from dictionaries of arrays or lists. With pandas, you can visualize your data and gain insights into your data quickly and easily.

3) Example DataFrame

Before exploring the crosstab() function, let’s first create a sample DataFrame to work with. The following code shows an example DataFrame:

“`python

import pandas as pd

data = {

‘Name’: [‘Alice’, ‘Bob’, ‘Charlie’, ‘Dave’, ‘Eva’],

‘Age’: [25, 32, 18, 47, 28],

‘Gender’: [‘F’, ‘M’, ‘M’, ‘M’, ‘F’],

‘Country’: [‘USA’, ‘Canada’, ‘UK’, ‘USA’, ‘Germany’],

‘Salary’: [55000, 65000, 45000, 80000, 60000]

}

df = pd.DataFrame(data)

print(df)

“`

This code creates a DataFrame with five columns: Name, Age, Gender, Country, and Salary. The output will look something like this:

“`

Name Age Gender Country Salary

0 Alice 25 F USA 55000

1 Bob 32 M Canada 65000

2 Charlie 18 M UK 45000

3 Dave 47 M USA 80000

4 Eva 28 F Germany 60000

“`

4) Example 1: Crosstab with Percentages Relative to All Values

Now, let’s create a crosstab with percentages relative to all values on the above DataFrame. We will use normalize=’all’ as the argument to create the crosstab.

Here’s the code:

“`python

import pandas as pd

data = {

‘Name’: [‘Alice’, ‘Bob’, ‘Charlie’, ‘Dave’, ‘Eva’],

‘Age’: [25, 32, 18, 47, 28],

‘Gender’: [‘F’, ‘M’, ‘M’, ‘M’, ‘F’],

‘Country’: [‘USA’, ‘Canada’, ‘UK’, ‘USA’, ‘Germany’],

‘Salary’: [55000, 65000, 45000, 80000, 60000]

}

df = pd.DataFrame(data)

crosstab = pd.crosstab(df[‘Gender’], df[‘Country’], normalize=’all’)

print(crosstab)

“`

The output of the code will be:

“`

Country Canada Germany UK USA

Gender

F 0.000000 0.166667 0.000000 0.166667

M 0.166667 0.000000 0.166667 0.500000

“`

This crosstab shows us the percentage of each combination of ‘Gender’ and ‘Country’ out of the total number of observations. We have normalized the crosstab to all values.

Interpretation of output

The output shows that there are two females, and we know their respective countries. Women represent 33.3% of all observations in this table.

In terms of gender distribution of the population, we have 50% males and 33.3% females. There are no observations for gender or country combinations not included in the original DataFrame.

This crosstab is helpful to know the overall distribution of gender and country observations in our DataFrame. By using a crosstab with normalization, we can see a percentage relative to all observations.

Conclusion

In conclusion, the crosstab() function in pandas is an incredibly useful tool for creating cross-tabulations of categorical variables. By using the normalize argument, we can compute the percentage of each value based on the total count, row totals, or column totals.

We can also gain insights by using these crosstabs in combination with data visualization tools such as histograms, bar charts, or pie charts. With pandas and crosstab(), we have a handy library for creating insightful visualizations and making data-driven decisions.

5) Example 2: Crosstab with Percentages Relative to Row Totals

Let’s create a crosstab with percentages relative to row totals. We will use normalize=’index’ as the argument to create the crosstab.

Here’s the code:

“`python

import pandas as pd

data = {

‘Name’: [‘Alice’, ‘Bob’, ‘Charlie’, ‘Dave’, ‘Eva’],

‘Age’: [25, 32, 18, 47, 28],

‘Gender’: [‘F’, ‘M’, ‘M’, ‘M’, ‘F’],

‘Country’: [‘USA’, ‘Canada’, ‘UK’, ‘USA’, ‘Germany’],

‘Salary’: [55000, 65000, 45000, 80000, 60000]

}

df = pd.DataFrame(data)

crosstab = pd.crosstab(df[‘Gender’], df[‘Country’], normalize=’index’)

print(crosstab)

“`

The output of the code will be:

“`

Country Canada Germany UK USA

Gender

F 0.000000 0.500000 0.000000 0.500000

M 0.166667 0.000000 0.166667 0.666667

“`

This crosstab shows us the percentage of each combination of ‘Gender’ and ‘Country’ out of the number of observations in the same row. We have normalized the crosstab to the index (rows).

Interpretation of output

The output shows that among females, 50% are from Germany, and the other 50% are from the USA. Among males, 66.7% are from the USA column.

This is a different way to show percentages than the previous example. Here, we are seeing the percentage of each value relative to row total.

6) Example 3: Crosstab with Percentages Relative to Column Totals

Let’s create a crosstab with percentages relative to column totals. We will use normalize=’columns’ as the argument to create the crosstab.

Here’s the code:

“`python

import pandas as pd

data = {

‘Name’: [‘Alice’, ‘Bob’, ‘Charlie’, ‘Dave’, ‘Eva’],

‘Age’: [25, 32, 18, 47, 28],

‘Gender’: [‘F’, ‘M’, ‘M’, ‘M’, ‘F’],

‘Country’: [‘USA’, ‘Canada’, ‘UK’, ‘USA’, ‘Germany’],

‘Salary’: [55000, 65000, 45000, 80000, 60000]

}

df = pd.DataFrame(data)

crosstab = pd.crosstab(df[‘Gender’], df[‘Country’], normalize=’columns’)

print(crosstab)

“`

The output of the code will be:

“`

Country Canada Germany UK USA

Gender

F 0.0 1.0 0.0 0.5

M 1.0 0.0 1.0 0.5

“`

This crosstab shows us the percentage of each combination of ‘Gender’ and ‘Country’ out of the number of observations in the same column. We have normalized the crosstab to the columns.

Interpretation of output

The output shows that among people from Canada, all belong to the female gender. Among people from Germany, all belong to the male gender.

For the UK column, there is an equal distribution of gender within that country. In the USA column, there are 50% males and 50% females.

Conclusion

In conclusion, the crosstab() function in pandas is a powerful tool for creating cross-tabulations of categorical variables. By using the normalize argument with different values, we can compute the percentage of each value relative to different denominators.

We can visualize crosstabs to gain insights from data. Crosst