Dropping Columns in a Pandas DataFrame
Have you ever come across a dataset that contained too many columns that you didn’t need for your analysis? This can be frustrating, right?
Fortunately, Python’s Pandas library makes it easy to drop unwanted column(s) from a DataFrame. Dropping columns from a DataFrame can be done in various ways.
In this article, we will explore three primary methods of dropping columns in a Pandas DataFrame: dropping one column by index, dropping multiple columns by index, and dropping one column by index with duplicates.
Dropping One Column by Index
The simplest way to drop a column in a Pandas DataFrame is to use the drop()
method and specify the column’s index number. The syntax for dropping one column in a DataFrame is as follows:
df.drop(df.columns[index_number], axis=1, inplace=True)
The df.columns[index_number]
specifies the index of the column to drop.
The axis=1
specifies that we are dropping a column rather than a row – axis=0
would mean we are dropping a row. Finally, the inplace=True
argument indicates that we want to modify the DataFrame as opposed to merely returning a modified copy.
Dropping Multiple Columns by Index
If you need to drop multiple columns, it is possible to do so with the drop()
method. In this case, pass a list of column index positions to the drop()
method.
The syntax for this is as follows:
df.drop(df.columns[[index_1, index_2, index_3]], axis=1, inplace=True)
This would drop columns at index positions index_1
, index_2
, and index_3
. You can add or remove columns as necessary.
Dropping One Column by Index with Duplicates
Sometimes, a DataFrame may contain multiple columns with the same name, which means that specifying the column name will not work. In this case, we need to specify the duplicate column’s index position to drop.
The code for dropping one column by index with duplicates is as follows:
df.drop(df.columns[index_number], axis=1, inplace=True)
This is the same syntax as dropping one column by index; the only difference is that we are specifying the index position of the duplicate column explicitly. Examples of
Dropping Columns in a Pandas DataFrame
Now that we’ve gained an understanding of how to drop columns in a Pandas DataFrame, let’s look at some examples to make it easier to understand.
Example 1: Drop One Column by Index
Suppose we have the following DataFrame:
import pandas as pd
import numpy as np
data = {
'Name': ['John', 'Mary', 'John', 'Elizabeth', 'David'],
'Age': [28, 22, 36, 39, 25],
'Gender': ['M', 'F', 'M', 'F', 'M'],
'Salary': [20000, 18000, 25000, 18000, 30000]
}
df = pd.DataFrame(data)
Suppose we want to drop the column with index position 2, i.e., the column named Gender
. We can use the following code to drop the column:
df.drop(df.columns[2], axis=1, inplace=True)
print(df)
Output:
Name Age Salary
0 John 28 20000
1 Mary 22 18000
2 John 36 25000
3 Elizabeth 39 18000
4 David 25 30000
Example 2: Drop Multiple Columns by Index
Suppose we want to drop columns with index positions 2 and 3, i.e., the columns named Gender
and Salary
. We can use the following code:
df.drop(df.columns[[2, 3]], axis=1, inplace=True)
print(df)
Output:
Name Age
0 John 28
1 Mary 22
2 John 36
3 Elizabeth 39
4 David 25
Example 3: Drop One Column by Index with Duplicates
Suppose we have the following DataFrame that contains two columns named Age
:
data = {
'Name': ['John', 'Mary', 'John', 'Elizabeth', 'David'],
'Age': [28, 22, 36, 39, 25],
'Gender': ['M', 'F', 'M', 'F', 'M'],
'Age': [20000, 18000, 25000, 18000, 30000]
}
df = pd.DataFrame(data)
Suppose we want to drop the second column named Age
with index position 3. We can use the following code:
df.drop(df.columns[3], axis=1, inplace=True)
print(df)
Output:
Name Age Gender
0 John 28 M
1 Mary 22 F
2 John 36 M
3 Elizabeth 39 F
4 David 25 M
Conclusion
In this article, we have seen various ways to remove columns from a Pandas DataFrame – from dropping one column, dropping multiple columns, to dropping one column that contains duplicates. By following these methods, you can easily remove unwanted data from your DataFrame and clean your datasets, making it easier to work with and analyze.
In conclusion, dropping columns in a Pandas DataFrame is a crucial data cleaning process that enables data analysts to extract insights from datasets effectively. This article explored three primary methods of dropping columns in a Pandas DataFrame: dropping one column by index, dropping multiple columns by index, and dropping one column by index with duplicates.
By following these methods, you can easily remove unwanted data from your DataFrame and clean your datasets, making it easier to work with and analyze. Remember, messy data can lead to inaccurate insights, and it’s imperative to ensure your data is as clean as possible.