Selecting Multiple Columns in Pandas DataFrame
Pandas is one of the most popular Python libraries used for data manipulation and analysis. One of the most common tasks when working with data in Python is selecting specific columns from a DataFrame.
In this article, we will explore three different methods for selecting multiple columns in a Pandas DataFrame.
Method 1: Select Columns by Index
The first method involves selecting columns by their index positions.
The DataFrame.iloc property can be used to select specific rows and columns based on their integer index. Here’s an example:
import pandas as pd
data = {'name': ['Alice', 'Bob', 'Charlie', 'David'],
'age': [25, 30, 35, 40],
'gender': ['female', 'male', 'male', 'male'],
'salary': [50000, 70000, 90000, 110000]}
df = pd.DataFrame(data)
# Select first and third columns
df.iloc[:, [0, 2]]
In the above example, we are using DataFrame.iloc to select all rows and the first and third columns based on their index positions (0 and 2).
Method 2: Select Columns in Index Range
The second method involves selecting a range of columns based on their index positions.
Here’s an example:
# Select columns from index 1 to index 3
df.iloc[:, 1:3]
In the above example, we are using DataFrame.iloc to select all rows and the columns from index position 1 to 3 (the third column is not included in the selection).
Method 3: Select Columns by Name
The third method involves selecting columns by their names.
The DataFrame property can be used to select specific columns based on their names. Here’s an example:
# Select columns by name
df[['name', 'salary']]
In the above example, we are using DataFrame to select the ‘name’ and ‘salary’ columns by name.
Example DataFrame Creation and Viewing
Now that we know how to select multiple columns, let’s create a simple DataFrame and view its contents.
Creating a DataFrame
# Create a DataFrame
data = {'name': ['John', 'Jane', 'Joe', 'Jen'],
'age': [25, 30, 35, 40],
'gender': ['male', 'female', 'male', 'female'],
'salary': [60000, 80000, 100000, 120000]}
df = pd.DataFrame(data)
In the above example, we are creating a DataFrame with four columns: ‘name’, ‘age’, ‘gender’, and ‘salary’.
Viewing a DataFrame
# View the entire DataFrame
print(df)
# View the first few rows of the DataFrame
print(df.head())
# View the last few rows of the DataFrame
print(df.tail())
In the above example, we are using the print() function to view the entire DataFrame and the first and last few rows of the DataFrame using the DataFrame.head() and DataFrame.tail() methods, respectively.
Conclusion
In this article, we have explored three different methods for selecting multiple columns in a Pandas DataFrame. We have also created a simple DataFrame and viewed its contents using various methods and properties of Pandas.
With these techniques in hand, you should be able to easily manipulate and analyze your data using Python and Pandas. Happy coding!
Selecting Columns by Index Positions
The first technique involves selecting columns by their index positions. Here, DataFrame.iloc property is used to select specific rows and columns based on their integer index.
The most basic selection is to pass a list of integer values for the column index. Below is an example:
import pandas as pd
data = {'name': ['Amanda', 'Bobby', 'Caroline', 'David', 'Emma'],
'age': [25, 39, 35, 45, 29],
'gender': ['female', 'male', 'female', 'male', 'female'],
'salary': [40000, 80000, 100000, 90000, 50000]}
df = pd.DataFrame(data)
# Select first and third columns
df.iloc[:, [0, 2]]
The above code selects the columns with index position 0 and 2, which are the ‘name’ and ‘gender’ columns.
Selecting Columns in an Index Range
The second technique involves selecting a range of columns based on their index positions. The syntax is similar to Python’s range operator.
The index range’s endpoint is exclusive, meaning that the last column is not included in the selection. Below is an example:
# select columns starting from index 1 and up to 3 (not inclusive of 3)
df.iloc[:, 1:3]
The above code selects the columns with index positions 1 and 2, which are the ‘age’ and ‘gender’ columns.
Below is an example:
# Select columns in an index range
df[['name', 'gender']]
The above code selects columns with the names (‘name’ and ‘gender’). We can also use the loc property to achieve the above goal:
# Select columns in an index range using loc property
df.loc[:, ['name', 'gender']]
This code achieves the same goal as the previous one but in a slightly different way.
In conclusion, you can use either of the three methods above in Pandas to select multiple columns based on their index positions or their names. The iloc indexing method selects rows and columns using integer indexing while the loc property selects rows and columns using the column’s name.
For a more straightforward selection, use the square bracket notation.
Below is an example:
import pandas as pd
data = {'name': ['Monica', 'Ross', 'Rachel', 'Joey', 'Chandler'],
'age': [25, 35, 30, 28, 33],
'gender': ['female', 'male', 'female', 'male', 'male'],
'salary': [5000, 10000, 8000, 6000, 7000]}
df = pd.DataFrame(data)
# Select name and gender columns
df[['name', 'gender']]
The above code selects the columns named ‘name’ and ‘gender’ from the DataFrame. We can also assign the selected columns to a new DataFrame variable:
# Assign selected columns to new DataFrame
new_df = df[['name', 'age']]
# View the new DataFrame
print(new_df)
The above code creates a new DataFrame with the selected columns (‘name’ and ‘age’) and assigns it to a new variable ‘new_df’.
Additional Resources: Other Common Operations in Pandas
Now that we have learned the techniques for selecting multiple columns in a Pandas DataFrame, let’s explore a few other common operations:
- Changing the column names: You can change the column names of a DataFrame by using its ‘columns’ property.
This is helpful when you want to replace spaces or special characters in column names. Below is an example:
# Change the column names
df.columns = ['Name', 'Age', 'Gender', 'Salary']
# View the DataFrame with changed column names
print(df)
The above code changes the column names to have uppercase initial letters.
- Dropping columns: You can drop one or more columns from a DataFrame using the drop() method.
The drop() method has an ‘axis’ parameter that you can set to 1 to indicate that you want to drop columns. Below is an example:
# Drop the 'salary' column
df = df.drop('Salary', axis=1)
# View the updated DataFrame
print(df)
The above code drops the ‘Salary’ column and returns the updated DataFrame without this column.
- Adding new columns: You can add new columns to a DataFrame using various methods such as indexing, assigning a scalar value, or using the apply() method.
Below is an example:
# Add a new column 'bonus'
df['bonus'] = df['Salary'] * 0.1
# View the updated DataFrame
print(df)
The above code adds a new column ‘bonus’ to the DataFrame by multiplying the existing ‘Salary’ column by 0.1.
In conclusion, Pandas is a robust library for data manipulation and analysis, and selecting columns based on index positions or column names is a fundamental operation. Other common operations include changing column names, dropping columns, and adding columns.
Pandas provides powerful tools to perform these operations effectively, and Pandas users can refer to the official documentation for a comprehensive list of functions and tools.
In summary, selecting multiple columns in a Pandas DataFrame is a fundamental task that is necessary for data cleaning and analysis.
There are three techniques for performing this task, including selecting columns by index positions, selecting columns in an index range, and selecting columns by name. Additionally, we explored other commonly used operations in Pandas, such as changing column names, dropping columns, and adding columns.
Pandas provides a robust set of tools and functions for data manipulation, and it is crucial for data analysts to be proficient in selecting columns in order to effectively conduct data analysis. Remember to consult the official Pandas documentation for a comprehensive list of functions and tools for data analysis using Pandas.