Adventures in Machine Learning

Mastering Data Selection in Pandas with loc and iloc Functions

Do you find it challenging to navigate through large and complex datasets in Pandas? Don’t worry; you’re not alone.

Pandas is a powerful library for data analysis; however, it can be intimidating for both beginners and experts. In particular, selecting specific rows and columns from a DataFrame can be a bit confusing.

Fortunately, Pandas provides two essential functions – loc and iloc – that can make Dataset selection much more manageable. In this article, we will explore the difference between the loc and iloc functions and how to use them most effectively.

Overview of loc and iloc functions

The loc and iloc functions are Pandas DataFrame methods used for selection (indexing and slicing) of elements in the DataFrame according to the index label or integer position. The loc function is used with index labels, while iloc is used with integer positions.

The loc function operates on two parameters: rows and columns. Rows are specified by index labels or Boolean arrays, while columns can be specified as index labels or Boolean arrays.

You can use the colon (:) to select ranges of rows and columns. The iloc function works the same way as loc, but with integer positions, not index labels.

Use of loc function

The loc function is incredibly useful for row and column selection, as well as filtering data based on criteria. Let’s see some examples:

Creating the DataFrame for loc example

Pandas is all about DataFrames. Before we explore the loc function in detail, we need a DataFrame to work with.

Let’s create one using Pandas’ built-in ‘read_csv’ function.

import pandas as pd

data = pd.read_csv(‘mydata.csv’)

print(data.head())

Using loc for row selection

The loc function can select specific rows based on index labels or Boolean arrays. By default, Pandas assigns an integer index to each row, starting from 0, but this can be overridden with custom index labels.

To select a row using loc, we need to provide a single index label that specifies the row we want. Let’s select the first row of our DataFrame, which has an index label of 0:

data.loc[0]

Using loc for row and column selection

Let’s say we want to select specific rows and columns from our DataFrame. We can use the loc function with Boolean arrays or index labels to filter our data.

Here’s an example:

data.loc[(data[‘Age’] > 30) & (data[‘Gender’] == ‘Male’), [‘Name’, ‘Age’]]

In the example above, we used Boolean filters to select only the rows where the Age is greater than 30 and the Gender is male. We also specified two columns – Name and Age – to display by passing a list of column names as the second parameter.

Using loc with: for range selection

The colon (:) can be used with the loc function to select a range of rows or columns based on index labels. To select a range of rows, we provide two index labels separated by a colon.

Here’s an example that selects all the rows between index 3 and index 7:

data.loc[3:7]

To select a range of columns, we provide two column labels separated by a colon. Here’s an example that selects all the columns between ‘Name’ and ‘Age’:

data.loc[:, ‘Name’:’Age’]

Use of iloc function

The iloc function is incredibly useful when you need to select specific rows and columns from a DataFrame based on their integer positions. Here are some examples:

Using iloc for row selection

We can use the iloc function to select specific rows from the DataFrame based on their integer position. To select the first row of our DataFrame, which has an integer position of 0, we can use the following code:

data.iloc[0]

Using iloc for row and column selection

We can use the iloc function to select specific rows and columns from the DataFrame based on their integer position. Here’s an example:

data.iloc[[0, 2, 4], [1, 3]]

In the example above, we used a list of integer positions to select the first, third, and fifth rows and columns 1 and 3.

By passing a list of integer positions as the first parameter, we can select multiple rows or columns. Using iloc with: for range selection

The iloc function can be used to select a range of rows or columns based on their integer positions.

Here’s an example:

data.iloc[3:7] # selects the rows between positions 3 and 7

data.iloc[:, 1:4] # selects the columns between positions 1 and 4

Conclusion

Pandas’ loc and iloc functions are powerful tools for selecting elements from DataFrames. The loc function is used with index labels, while iloc is used with integer positions.

Both functions can select specific rows and columns by using Boolean arrays or index labels/integers. Remember to use the colon (:) with both loc and iloc functions to select ranges of rows and columns.

With this knowledge, you should be able to navigate and analyze complex datasets more effectively.

Example 2 – How to Use iloc in Pandas

In our previous section, we learned that iloc is used for row and column selection based on integer positions. In this section, we will explore how to use iloc to filter our data based on those specific positions.

We will use a similar DataFrame to Example 1 but with more data.

Creating the DataFrame for iloc example

Let’s create a new DataFrame to work with:

“`

import pandas as pd

data = {‘Name’: [‘Adam’, ‘Bailey’, ‘Charles’, ‘David’, ‘Emily’],

‘Age’: [27, 34, 19, 44, 38],

‘Gender’: [‘Male’, ‘Female’, ‘Male’, ‘Male’, ‘Female’],

‘Salary’: [50000, 60000, 40000, 80000, 65000],

‘City’: [‘New York’, ‘Los Angeles’, ‘Chicago’, ‘Seattle’, ‘Denver’]}

df = pd.DataFrame(data, index=[‘A’, ‘B’, ‘C’, ‘D’, ‘E’])

“`

Let’s see some use cases of the iloc function.

Using iloc for row selection

We can select rows from a DataFrame based on their row index. For example, to select the first row of our DataFrame, we can use the following code:

“`

df.iloc[0]

“`

Output:

“`

Name Adam

Age 27

Gender Male

Salary 50000

City New York

Name: A, dtype: object

“`

To select multiple rows, we can use iloc with a list of integer positions. “`

df.iloc[[1, 3, 4]]

“`

Output:

“`

Name Age Gender Salary City

B Bailey 34 Female 60000 Los Angeles

D David 44 Male 80000 Seattle

E Emily 38 Female 65000 Denver

“`

Using iloc for row and column selection

We can use iloc to select specific rows and columns of a DataFrame. For example, to select the first two rows of our DataFrame, and their first two columns, we can use the following code:

“`

df.iloc[0:2, 0:2]

“`

Output:

“`

Name Age

A Adam 27

B Bailey 34

“`

In the above example, we used integer positions to specify both rows (0 to 2) and columns (0 to 2). Note that the upper-bound range value is exclusive of the selected element.

We can also use iloc with a list of integer positions to select specific rows and columns:

“`

df.iloc[[1, 3], [0, 3]]

“`

Output:

“`

Name Salary

B Bailey 60000

D David 80000

“`

Using iloc with : for range selection

We can use iloc with the : operator to select a range of rows or columns. For example, to return the first three rows of the DataFrame,

“`

df.iloc[:3,:]

“`

Output:

“`

Name Age Gender Salary City

A Adam 27 Male 50000 New York

B Bailey 34 Female 60000 Los Angeles

C Charles 19 Male 40000 Chicago

“`

Similarly, we can select a range of columns by specifying the range of column positions. “`

df.iloc[:, 1:4]

“`

Output:

“`

Age Gender Salary

A 27 Male 50000

B 34 Female 60000

C 19 Male 40000

D 44 Male 80000

E 38 Female 65000

“`

Additional Resources

Pandas is a powerful library with a variety of complex functions that you can use to work with data effectively. If you are new to pandas, you might want to consider starting with these common operations and functions:

– Data input/output (read and write data from/to various sources such as CSV, excel spreadsheet, database)

– Data selection (loc, iloc, boolean indexing)

– Data aggregation (groupby, pivot tables)

– Data cleaning and manipulation (merging, concatenating, pivoting, reshaping data)

– Data visualization

If youre looking to learn more about these operations, there are many comprehensive tutorials and documentation available.

Here are some helpful Pandas resources to get you started:

– The official Pandas documentation: https://pandas.pydata.org/docs/

– Pandas Cheatsheet: https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf

– Pandas Tutorial: https://www.w3schools.com/python/pandas/default.asp

– Pandas Cookbook: https://pandas.pydata.org/pandas-docs/stable/user_guide/cookbook.html

Don’t forget to practice your new skills with Pandas in Jupyter notebooks. Jupyter notebooks allow you to experiment and iterate quickly, visualize data in real-time, and document your analysis in code.

In this article, we explored the difference between loc and iloc functions in Pandas data selection. While loc is used for indexing by label, iloc is used for indexing by integer position.

We learned how to use both functions to select and filter rows and columns from a DataFrame, and how to use them to select ranges of rows and columns. By understanding loc and iloc and their usage, you can effectively extract meaningful insights from your datasets.

Remember to consult additional resources such as the Pandas documentation, tutorials, and cheat sheet to further improve your Pandas knowledge and skills. With these takeaway points, you can apply Pandas effectively to your data analysis tasks.

Popular Posts