Adventures in Machine Learning

Mastering Data Cleaning with Pandas and NumPy: Techniques for Reliable Results

Data is ubiquitous in the modern world. Companies use it to make business decisions, scientists use it to make discoveries, and individuals use it for personal reasons.

However, data, in its raw form, can be overwhelming to work with. That’s why data analysts and data scientists use tools like pandas to manipulate and transform data into something more manageable and meaningful.

In this article, we’ll explore two important pandas functions: dropping columns and changing the index of a DataFrame.

Dropping Columns in a DataFrame

A DataFrame is a 2-dimensional labeled data structure, with columns of different types. Sometimes, a DataFrame may have more columns than necessary, which can make it difficult to work with and understand.

For instance, imagine a DataFrame with columns for first name, last name, middle name, and prefix. If we only need to analyze the data by last name, first name, and age, having the extra columns will only add noise to our analysis.

Thus, we need to drop the unnecessary columns. Pandas provides an easy way to drop columns using the drop() function.

The syntax for drop is as follows:

df.drop(['col1', 'col2'], inplace=True, axis=1)

Here, df is the DataFrame we want to modify, ['col1', 'col2'] is a list of columns we want to drop, inplace=True tells pandas to modify the DataFrame in place, and axis=1 tells pandas to drop columns (0 is used for rows). Let’s apply this to our example DataFrame:

import pandas as pd
data = {'first name': ['John', 'Jane', 'Mike'],
        'last name': ['Doe', 'Doe', 'Smith'],
        'middle name': ['Henry', 'Elizabeth', 'Robert'],
        'prefix': ['Mr.', 'Ms.', 'Dr.'],
        'age': [30, 25, 40]}
df = pd.DataFrame(data)
# drop the unnecessary columns
df.drop(['middle name', 'prefix'], inplace=True, axis=1)

The resulting DataFrame will have only the columns we need:

first name last name age
John Doe 30
Jane Doe 25
Mike Smith 40

Changing the Index of a DataFrame

The index of a DataFrame is its identifying field. By default, pandas assigns an integer index to each row of a DataFrame.

However, this may not be the most suitable index for our purposes. For example, imagine a DataFrame that contains information on movies, with columns for title, director, year, rating, and genre.

We may want to access movies based on their title, and the integer index is not a good identifier for that. In such cases, we can set the index of the DataFrame to the title column, which is a unique identifier.

Pandas provides the set_index() function to change the index of a DataFrame. The syntax for set_index() is as follows:

df.set_index('col_name', inplace=True)

Here, df is the DataFrame we want to modify, 'col_name' is the name of the column we want to use as the index, and inplace=True tells pandas to modify the DataFrame in place.

Let’s apply this to our movie DataFrame:

data = {'title': ['The Shawshank Redemption', 'The Dark Knight', '12 Angry Men'],
        'director': ['Frank Darabont', 'Christopher Nolan', 'Sidney Lumet'],
        'year': [1994, 2008, 1957],
        'rating': [9.3, 9, 8.9],
        'genre': ['Drama', 'Action', 'Drama']}
df = pd.DataFrame(data)
# set the index to the title column
df.set_index('title', inplace=True)

Now we can access movies based on their title using the loc[] function:

# access a movie using its title
df.loc['The Dark Knight']

The output will be:

director year rating genre
Christopher Nolan 2008 9.0 Action

Conclusion

In conclusion, pandas is an essential tool for any data analyst or scientist. In this article, we explored two important pandas functions: dropping columns and changing the index of a DataFrame.

By dropping unnecessary columns, we can make our DataFrames more compact and easier to work with. By changing the index to a unique identifier, we can make it easier to access and analyze specific rows of the DataFrame.

With these two functions and other built-in features of pandas, we can manipulate and transform our data into something more meaningful and insightful.

Tidying up Fields in the Data

Data is not always presented in a user-friendly format. Messy and unstructured data can cause problems and produce unreliable results when fed into analysis models.

Therefore, it is crucial to clean up the data before conducting an analysis. In this article, we’ll cover two common methods of tidying up fields in the data.

First, we’ll explore how to clean up the Date of Publication column, and then we’ll look at how to use str methods with NumPy to clean up the Place of Publication column.

Cleaning Date of Publication column

The Date of Publication column is an important aspect of any collection of printed materials. However, it is not always presented in a standardized or easily readable format.

Cleaning up the Date of Publication column can be an essential part of digitizing a library or archive. One way to clean up the Date of Publication column is to use regular expressions (regex) to replace all non-numeric characters with an empty string.

After that, we can use the pd.to_numeric() method to convert the string to an integer.

import pandas as pd
import numpy as np
df = pd.read_csv('book_data.csv')
# clean the Date of Publication column
df['Date of Publication'] = df['Date of Publication'].str.replace(r'[^d]+', '')
df['Date of Publication'] = pd.to_numeric(df['Date of Publication'], errors='coerce')

The first line of the code replaces all non-numeric characters in the Date of Publication column with an empty string. The second line of the code uses the pd.to_numeric() method to convert the string to an integer.

The method has an “errors” parameter set to “coerce”, which means that any non-numeric values will be converted to NaN. Using np.where() to clean Place of Publication column

The Place of Publication column is another example of messy and unstructured data.

It can contain typographical errors, abbreviations, or incomplete information. One way to tidy the Place of Publication column is to use NumPy’s where function.

Suppose we have a DataFrame containing a column for Place of Publication and we want to replace certain entries with a new value. We can use the np.where() method to replace the entries that match our criteria with a new value.

df['Place of Publication'] = np.where(df['Place of Publication'].str.contains('London'),
                                      'London',
                                      df['Place of Publication'])

The code replaces any entry in the Place of Publication column that contains “London” with “London”. We can also write similar code to replace any other location or a group of locations.

Combining str Methods with NumPy to Clean Columns

Cleaning columns using string methods is a common technique. However, combining string methods with NumPy can make the process more efficient.

Suppose we want to clean the Place of Publication column and remove any text that appears after a comma. We can use the str.contains() method to identify the rows that contain a comma and then use np.where() to replace the comma and the trailing text with an empty string.

df['Place of Publication'] = np.where(df['Place of Publication'].str.contains(','), 
                                      df['Place of Publication'].str.split(',').str[0], 
                                      df['Place of Publication'])

The code splits the string on the comma and selects the first part, effectively removing all text that appears after the comma.

Conclusion

Cleaning up messy data is an important step when preparing data for analysis. In this article, we covered several techniques to tidy up fields in a DataFrame.

We discussed replacing non-numeric characters in the Date of Publication column with an empty string and using NumPy’s where function to replace certain entries in the Place of Publication column. We also learned how to combine str methods with np.where() to clean up the Place of Publication column by removing text that appears after a comma.

With these techniques, cleaning up messy data becomes more efficient and accurate.

Cleaning the Entire Dataset Using the applymap Function

Data cleaning is a vital part of preparing data sets for analysis. Pandas provide several ways of cleaning data, including the applymap() function.

The applymap() method applies a function to every element in a DataFrame. This technique allows us to perform similar cleaning operations on every part of the dataset.

For instance, suppose we have a DataFrame with columns for values in different units of measure. We can write a function to convert all values to a single unit of measure, apply that function to every cell in the DataFrame using applymap(), and have all the values in the same unit of measure.

import pandas as pd
data = {'distance in km': [10, 20, 30],
        'distance in miles': [6.21371, 12.4274, 18.6411],
        'distance in meters': [10000, 20000, 30000]}
df = pd.DataFrame(data)
# convert every cell in the DataFrame to miles
df = df.applymap(lambda x: x * 0.621371 if 'km' in str(x) else x)
df = df.applymap(lambda x: x * 0.000621371 if 'meters' in str(x) else x)

Here, we create a DataFrame with columns containing distances in different units of measurement. We then create two lambda functions to convert all the distances in km and meters to miles.

The lambda functions use the str method to check if the units are km or meters, and if so, convert them to miles. Finally, we apply the lambda functions to every cell in the DataFrame using applymap().

Renaming Columns and Skipping Rows

Renaming columns is another aspect of cleaning data that we can perform using pandas. In some cases, column names may not be descriptive enough, or they may contain typographical errors.

We can use the rename() method to modify the column names.

import pandas as pd
data = {'first_name': ['John', 'Jane', 'Mike'],
        'last_name': ['Doe', 'Smith', 'Brown'],
        'age(T)': [30, 25, 40]}
df = pd.DataFrame(data)
# rename the columns
df.rename(columns={'first_name': 'First Name', 'last_name': 'Last Name', 'age(T)': 'Age'}, inplace=True)

Here we define a dictionary with keys corresponding to the current column names and values corresponding to the new column names. We then pass the dictionary to the rename() method, with the parameter columns set to True.

The rename() method modifies the column names in place. Skipping rows is another cleaning operation that allows us to remove unwanted data from our DataFrame.

We can skip rows using skiprows parameter when we read in our data.

import pandas as pd
df = pd.read_csv('example.csv', skiprows=2)

Here, we read in a CSV file called example.csv using Pandas’ read_csv() method. We use the skiprows parameter to specify that we want to ignore the first two rows of the CSV file.

Conclusion

Cleaning data is an essential step for any data analysis project. In this article, we explored two techniques with Pandas applymap() function and renaming columns using pandas’ rename() method.

The applymap() allows us to apply a function to every element of a dataset, making it useful when we want to clean the entire dataset. Renaming columns is important, especially when we have typographical errors or non-standard naming conventions.

In addition, skipping rows and removing unwanted data from our DataFrame is another vital data cleaning technique. With these techniques, we can prepare our data for analysis and produce reliable and meaningful results.

In conclusion, cleaning up messy and unstructured data is a crucial part of preparing a dataset for analysis. This requires various techniques such as dropping columns, changing the index, tidying up fields using pandas and NumPy, and cleaning the entire dataset.

Using applymap() and lambda functions allow us to clean the entire dataset, while the renaming columns and skipping rows method helps us with other cleaning operations. These techniques are essential as they help in producing reliable and meaningful results when analyzing data.

Therefore, data analysts and scientists must prioritize cleaning data to ensure accurate and useful results. It is essential to pay attention to data cleaning since it provides the foundation for robust analysis and prevents incorrect interpretations leading to ineffective decision-making.

Popular Posts