Crosstab() Function in Pandas
Pandas is a popular library for data manipulation and analysis. One of the most useful functions in pandas is crosstab(), which allows you to create a cross-tabulation of two or more factors.
A cross-tabulation is a useful way to summarize the distribution of a categorical variable. It shows the frequency or percentage of each variable combination.
Using the crosstab() function is easy. You just need to pass two or more arrays or Series as arguments to the function.
The output is a DataFrame containing the cross-tabulation. Let’s look at an example:
import pandas as pd
df = pd.read_csv('sales.csv')
print(pd.crosstab(df['Region'], df['Product']))
This code reads a CSV file and creates a cross-tabulation of the ‘Region’ and ‘Product’ columns. The output will look something like this:
Product A B C
Region
East 10 5 2
North 15 3 8
South 6 4 7
West 8 6 1
In this example, the table shows the number of sales for each product in each region. A value of 10 in the first row and first column means that there were 10 sales of Product A in the East region.
Using normalize argument in crosstab()
The crosstab() function also supports a normalize argument that allows you to compute the percentage of each value. The normalize argument can take three values: ‘all’, ‘index’, and ‘columns’.
These values determine the denominator for the percentage calculation.
- If you set normalize=’all’, the percentages are relative to the total count.
- In other words, each value is expressed as a percentage of the total number of observations in the table.
- If you set normalize=’index’, the percentages are relative to the row totals.
- In other words, each value is expressed as a percentage of the number of observations in the same row.
- If you set normalize=’columns’, the percentages are relative to the column totals.
- In other words, each value is expressed as a percentage of the number of observations in the same column.
Example 1: Create Crosstab with Percentages Relative to All Values
import pandas as pd
df = pd.read_csv('sales.csv')
print(pd.crosstab(df['Region'], df['Product'], normalize='all'))
Product A B C
Region
East 0.166667 0.083333 0.033333
North 0.250000 0.050000 0.133333
South 0.100000 0.066667 0.116667
West 0.133333 0.100000 0.016667
In this example, each value is expressed as a percentage of the total number of observations in the table. For example, the value 0.166667 in the first row and first column means that 16.67% of all observations are in the East region and correspond to Product A.
Example 2: Create Crosstab with Percentages Relative to Row Totals
import pandas as pd
df = pd.read_csv('sales.csv')
print(pd.crosstab(df['Region'], df['Product'], normalize='index'))
Product A B C
Region
East 0.625000 0.312500 0.062500
North 0.535714 0.107143 0.357143
South 0.300000 0.200000 0.500000
West 0.470588 0.352941 0.176471
In this example, each value is expressed as a percentage of the row total. For example, the value 0.625000 in the first row and first column means that 62.50% of the observations in the East region correspond to Product A.
Example 3: Create Crosstab with Percentages Relative to Column Totals
import pandas as pd
df = pd.read_csv('sales.csv')
print(pd.crosstab(df['Region'], df['Product'], normalize='columns'))
Product A B C
Region
East 0.250000 0.208333 0.090909
North 0.375000 0.125000 0.363636
South 0.150000 0.166667 0.318182
West 0.200000 0.500000 0.227273
In this example, each value is expressed as a percentage of the column total. For example, the value 0.250000 in the first row and first column means that 25.00% of all observations correspond to Product A in the East region.
DataFrame Creation
Importing Pandas
Before we can create a DataFrame, we need to import the pandas library. You can import pandas using the following code:
import pandas as pd
Creating a DataFrame
Once you have imported pandas, you can create a DataFrame by passing a dictionary of arrays or lists to the pd.DataFrame() function. Each key in the dictionary will become a column in the DataFrame, and each array or list will become the values in the corresponding column.
Here’s an example:
import pandas as pd
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'Dave'],
'Age': [25, 32, 18, 47],
'Gender': ['F', 'M', 'M', 'M'],
'Salary': [55000, 65000, 45000, 80000]
}
df = pd.DataFrame(data)
print(df)
This code creates a DataFrame with four columns: Name, Age, Gender, and Salary. The output will look something like this:
Name Age Gender Salary
0 Alice 25 F 55000
1 Bob 32 M 65000
2 Charlie 18 M 45000
3 Dave 47 M 80000
DataFrame structure
The DataFrame has rows and columns. Each column has a name, and each row has an index that starts from zero and increases by one for each row.
You can access the values in a DataFrame using the column name and index. For example, to get the value in the first row and second column (Bob’s age), you can use the following code:
print(df.loc[1, 'Age'])
This will print 32.
DataFrame visualization
When working with DataFrames, it’s often useful to visualize the data. Pandas provides several functions for this purpose.
For example, you can use the head() function to see the first few rows of a DataFrame:
print(df.head())
This will print the first five rows of the DataFrame:
Name Age Gender Salary
0 Alice 25 F 55000
1 Bob 32 M 65000
2 Charlie 18 M 45000
3 Dave 47 M 80000
You can also use the tail() function to see the last few rows:
print(df.tail(2))
This will print the last two rows of the DataFrame:
Name Age Gender Salary
2 Charlie 18 M 45000
3 Dave 47 M 80000
Conclusion
In conclusion, pandas is a powerful library for data manipulation and analysis. The crosstab() function is a useful tool for creating cross-tabulations of categorical variables.
You can also use pandas to create DataFrames from dictionaries of arrays or lists. With pandas, you can visualize your data and gain insights into your data quickly and easily.
3) Example DataFrame
Before exploring the crosstab() function, let’s first create a sample DataFrame to work with. The following code shows an example DataFrame:
import pandas as pd
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'Dave', 'Eva'],
'Age': [25, 32, 18, 47, 28],
'Gender': ['F', 'M', 'M', 'M', 'F'],
'Country': ['USA', 'Canada', 'UK', 'USA', 'Germany'],
'Salary': [55000, 65000, 45000, 80000, 60000]
}
df = pd.DataFrame(data)
print(df)
This code creates a DataFrame with five columns: Name, Age, Gender, Country, and Salary. The output will look something like this:
Name Age Gender Country Salary
0 Alice 25 F USA 55000
1 Bob 32 M Canada 65000
2 Charlie 18 M UK 45000
3 Dave 47 M USA 80000
4 Eva 28 F Germany 60000
4) Example 1: Crosstab with Percentages Relative to All Values
Now, let’s create a crosstab with percentages relative to all values on the above DataFrame. We will use normalize=’all’ as the argument to create the crosstab.
Here’s the code:
import pandas as pd
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'Dave', 'Eva'],
'Age': [25, 32, 18, 47, 28],
'Gender': ['F', 'M', 'M', 'M', 'F'],
'Country': ['USA', 'Canada', 'UK', 'USA', 'Germany'],
'Salary': [55000, 65000, 45000, 80000, 60000]
}
df = pd.DataFrame(data)
crosstab = pd.crosstab(df['Gender'], df['Country'], normalize='all')
print(crosstab)
The output of the code will be:
Country Canada Germany UK USA
Gender
F 0.000000 0.166667 0.000000 0.166667
M 0.166667 0.000000 0.166667 0.500000
This crosstab shows us the percentage of each combination of ‘Gender’ and ‘Country’ out of the total number of observations. We have normalized the crosstab to all values.
Interpretation of output
The output shows that there are two females, and we know their respective countries. Women represent 33.3% of all observations in this table.
In terms of gender distribution of the population, we have 50% males and 33.3% females. There are no observations for gender or country combinations not included in the original DataFrame.
This crosstab is helpful to know the overall distribution of gender and country observations in our DataFrame. By using a crosstab with normalization, we can see a percentage relative to all observations.
5) Example 2: Crosstab with Percentages Relative to Row Totals
Let’s create a crosstab with percentages relative to row totals. We will use normalize=’index’ as the argument to create the crosstab.
Here’s the code:
import pandas as pd
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'Dave', 'Eva'],
'Age': [25, 32, 18, 47, 28],
'Gender': ['F', 'M', 'M', 'M', 'F'],
'Country': ['USA', 'Canada', 'UK', 'USA', 'Germany'],
'Salary': [55000, 65000, 45000, 80000, 60000]
}
df = pd.DataFrame(data)
crosstab = pd.crosstab(df['Gender'], df['Country'], normalize='index')
print(crosstab)
The output of the code will be:
Country Canada Germany UK USA
Gender
F 0.000000 0.500000 0.000000 0.500000
M 0.166667 0.000000 0.166667 0.666667
This crosstab shows us the percentage of each combination of ‘Gender’ and ‘Country’ out of the number of observations in the same row. We have normalized the crosstab to the index (rows).
Interpretation of output
The output shows that among females, 50% are from Germany, and the other 50% are from the USA. Among males, 66.7% are from the USA column.
This is a different way to show percentages than the previous example. Here, we are seeing the percentage of each value relative to row total.
6) Example 3: Crosstab with Percentages Relative to Column Totals
Let’s create a crosstab with percentages relative to column totals. We will use normalize=’columns’ as the argument to create the crosstab.
Here’s the code:
import pandas as pd
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'Dave', 'Eva'],
'Age': [25, 32, 18, 47, 28],
'Gender': ['F', 'M', 'M', 'M', 'F'],
'Country': ['USA', 'Canada', 'UK', 'USA', 'Germany'],
'Salary': [55000, 65000, 45000, 80000, 60000]
}
df = pd.DataFrame(data)
crosstab = pd.crosstab(df['Gender'], df['Country'], normalize='columns')
print(crosstab)
The output of the code will be:
Country Canada Germany UK USA
Gender
F 0.0 1.0 0.0 0.5
M 1.0 0.0 1.0 0.5
This crosstab shows us the percentage of each combination of ‘Gender’ and ‘Country’ out of the number of observations in the same column. We have normalized the crosstab to the columns.
Interpretation of output
The output shows that among people from Canada, all belong to the female gender. Among people from Germany, all belong to the male gender.
For the UK column, there is an equal distribution of gender within that country. In the USA column, there are 50% males and 50% females.
Conclusion
In conclusion, the crosstab() function in pandas is a powerful tool for creating cross-tabulations of categorical variables. By using the normalize argument with different values, we can compute the percentage of each value relative to different denominators.
We can visualize crosstabs to gain insights from data. Crosstabs are helpful for:
- Determining the distribution of categorical variables.
- Identifying relationships between categorical variables.
- Gaining insights into the data by calculating percentages.
By using pandas, we can efficiently analyze and visualize our data, leading to data-driven decisions.