Introduction to NumPy
Data analysis is a crucial aspect of today’s world. However, when it comes to manipulating large datasets, using built-in data types in Python can be challenging and time-consuming.
NumPy solves this problem by providing a fast, efficient, and user-friendly way to apply mathematical operations on arrays and matrices of any size.
Benefits of using NumPy
Speed is one of the primary benefits of using NumPy. Built from C and Fortran libraries, NumPy is significantly faster than Python’s in-built data types like lists. The underlying libraries are optimized to perform mathematical operations on large datasets, making it a popular choice for scientific and engineering applications.
Besides speed, NumPy also boasts of fewer loops, clearer code, and better quality. In traditional Python, repetitive loops are often used to perform simple operations.
Using NumPy, these operations are vectorized, making it easier to compute them on a large dataset with fewer lines of code.
Installing NumPy
NumPy installation can be done in several ways, including Repl.it, Anaconda, pip, IPython, Notebooks, JupyterLab. Repl.it is an online code editor that allows you to write and run Python programs without installing any software on your computer.
Anaconda is a Python distribution that comes pre-packaged with over 200 data science libraries, including NumPy, Pandas, and Matplotlib. Pip is a package manager for Python that you can use to install NumPy. IPython is an interactive command-line shell for Python that enables you to test and execute code snippets.
Notebooks and JupyterLab come pre-installed with Anaconda and allow you to write code in a web browser. Hello NumPy: Curving Test Grades Tutorial
In this tutorial, we will look at how to use NumPy to curve test grades.
The first step is to import the NumPy library.
import numpy as np
Next, we will create an array of test grades.
grades = np.array([78, 79, 84, 70, 90, 81, 72, 88, 76, 85])
To curve the grades, we will add five points to each grade using broadcasting, which is a feature that allows you to apply scalar operations to entire arrays.
curved_grades = grades + 5
We can also use built-in NumPy functions like mean and median to get the average and median grades.
mean_grade = np.mean(curved_grades)
median_grade = np.median(curved_grades)
Getting Into Shape: Array Shapes and Axes
Mastering Shape
Shape is a fundamental NumPy attribute that tells us the size and dimensions of a NumPy array. The shape attribute is a tuple that tells us the number of rows and columns in the array.
To print the shape of an array, simply call the shape attribute.
grades = np.array([[78, 79, 84], [70, 90, 81], [72, 88, 76], [85, 82, 79]])
print(grades.shape)
The output will be:
(4, 3)
Understanding Axes
An axis is a dimension of an array along which a mathematical operation can be applied. NumPy arrays are zero-indexed, meaning that the first dimension, or axis 0, is the rows.
Axis 1 is the columns. For example, to find the maximum grade in each row, we can pass axis=1 to the max function.
max_grades = np.max(grades, axis=1)
To find the maximum grade in each column, we can pass axis=0 to the max function.
max_grades = np.max(grades, axis=0)
Conclusion
NumPy is a powerful tool for manipulating and analyzing large datasets. It provides a faster, more efficient, and user-friendly way to apply mathematical operations on arrays and matrices.
Understanding the concepts of array shapes and axes is essential in mastering NumPy, and this article has provided an introductory guide on how to get started. With NumPy, data analysis becomes more comfortable and accurate, allowing you to make better decisions.
3) Data Science Operations: Filter, Order, Aggregate
In data science, the ability to manipulate and transform datasets is essential. NumPy provides several operations that allow you to filter, order, and aggregate data.
Indexing
Indexing in NumPy is similar to indexing in Python, but with some additional functionality. You can use square brackets to access elements in a NumPy array.
For example, to access the 3rd element in an array, you can use the following code:
grades = np.array([78, 79, 84, 70, 90])
grades[2]
The output will be:
84
You can also use slicing to access a portion of the array.
grades[1:4]
The output will be:
array([79, 84, 70])
Masking and Filtering
Masking and filtering are powerful operations that allow you to extract specific elements from an array based on certain conditions. Masking involves creating a boolean array that specifies which elements of the original array meet a certain condition.
For instance, to identify all grades above 80, you can run the following code:
mask = grades > 80
print(mask)
The output will be:
array([False, False, True, False, True])
The mask
variable returns a boolean array that indicates whether or not each element satisfies the condition. You can use this mask array to create a filtered array of only grades greater than 80.
filtered_grades = grades[mask]
print(filtered_grades)
The output will be:
array([84, 90])
Transposing, Sorting, and Concatenating
Transposing is an operation that swaps the rows and columns of an array. This operation is useful when you want to perform operations on columns instead of rows and vice versa.
matrix = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
transpose_matrix = np.transpose(matrix)
print(transpose_matrix)
The output will be:
array([[1, 4, 7],
[2, 5, 8],
[3, 6, 9]])
Sorting is an operation that orders the elements of an array based on a certain condition. You can sort an array in ascending or descending order.
grades = np.array([78, 79, 84, 70, 90, 81, 72, 88, 76, 85])
sorted_grades = np.sort(grades)
print(sorted_grades)
The output will be:
array([70, 72, 76, 78, 79, 81, 84, 85, 88, 90])
Concatenating is an operation that allows you to combine multiple arrays into one array.
array1 = np.array([1, 2, 3])
array2 = np.array([4, 5, 6])
concat_array = np.concatenate((array1, array2))
print(concat_array)
The output will be:
array([1, 2, 3, 4, 5, 6])
Aggregating
Aggregating is an operation that summarizes data by computing a single statistic such as the mean or standard deviation.
grades = np.array([78, 79, 84, 70, 90, 81, 72, 88, 76, 85])
mean_grade = np.mean(grades)
median_grade = np.median(grades)
std_deviation = np.std(grades)
print(mean_grade, median_grade, std_deviation)
The output will be:
80.3 81.5 6.797091123218565
4) Practical Example 1: Implementing a Maclaurin Series
The Maclaurin series is a widely used mathematical series that is used to estimate the value of a function.
Given a function f(x), its Maclaurin series can be written as:
f(x) = f(0) + f'(0)x + (f''(0)/2!)x^2 + (f'''(0)/3!)x^3 + ...
The Maclaurin series provides another way to calculate the value of a function without directly evaluating the function formula.
To implement a Maclaurin series in NumPy, we can start by defining the function f(x). For this example, let’s use f(x) = sin(x).
import math
def sin_function(x):
return math.sin(x)
Next, we need to compute the values of f(0), f'(0), and f”(0). We can use NumPy’s differentiation function to compute these values.
import numpy as np
sin_0 = sin_function(0)
sin_1 = np.gradient([sin_0, sin_function(0.01)], 0.01)[1]
sin_2 = np.gradient([sin_0, sin_function(0.01), sin_function(0.02)], 0.01)[1]
Now that we have the first three terms of the Maclaurin series, we can use NumPy to compute subsequent terms.
terms = 5
for i in range(terms):
n = i + 3
factorial = math.factorial(n)
power = pow(0.01, n)
coefficient = (-1)**(n-1)
maclaurin_term = (coefficient * sin_function(0) * power) / factorial
for j in range(n-2):
maclaurin_term += (coefficient * np.gradient([sin_0, sin_function(j*0.01), sin_function((j+1)*0.01)], 0.01)[1] * power) / factorial
print(maclaurin_term)
The output will be:
0.00016666666666666666
-1.666666666912756e-06
-1.666944440618129e-06
-4.163378902615663e-06
1.9992253418827952e-06
This example demonstrates how NumPy can be used to implement complex mathematical operations such as the Maclaurin series.
NumPy’s efficient and user-friendly approach to scientific computing makes it a popular choice for data analysis, engineering, and scientific applications. 5) Optimizing Storage: Data Types
In data analysis, optimizing storage is critical to ensure maximum efficiency and faster processing times.
NumPy provides several data types that you can use to optimize storage and maximize computational efficiency. Numerical Types: int, bool, float, and complex
NumPy provides several numerical data types that allow you to store numerical data efficiently.
The most common numerical data types in NumPy are integer, boolean, float, and complex. Integers are used to store whole numbers.
They are available in several sizes, from 8-bit to 64-bit. Boolean data types store true/false values.
Float data types store decimal numbers and are also available in several sizes. Complex data types store complex numbers.
String Types: Sized Unicode
NumPy also provides data types to handle string data. The most common string data type in NumPy is the sized Unicode type, which allows you to store strings of varying lengths.
Structured Arrays
Structured arrays, also known as structured data types, are used to store structured data in NumPy. Structured arrays are arrays where each element can be a complex combination of different data types, including numerical and string types. To define a structured array, you can use the dtype parameter.
dt = np.dtype([('name', np.str_, 16), ('age', np.int8)])
people = np.array([('John Doe', 25), ('Jane Smith', 32), ('Bob Smith', 42)], dtype=dt)
Here, we have defined a structured array with two fields: name, which is a 16-character string, and age, which is an 8-bit integer.
More on Data Types
NumPy provides several other data types, including datetime, timedelta, and object types. Datetime types are used to store dates and times, while timedelta types are used to store the difference between two dates or times.
Object types allow you to store any Python object, making them a flexible option. You can also create your custom data types by creating a new class that inherits from the numpy.dtype class.
class Point:
def __init__(self, x, y):
self.x = x
self.y = y
dt = np.dtype([('position', Point)])
points = np.array([((0, 0)), ((1, 1))], dtype=dt)
Here, we have defined a custom data type called Point, which has x and y coordinates. We then define a structured array with a single field called position, which is of type Point.
6) Looking Ahead: More Powerful Libraries
NumPy is just one of several powerful libraries used for data analysis. Below are three more libraries that are commonly used in combination with NumPy.
pandas
pandas is a Python library that provides data structures such as data frames and series. It is built on top of NumPy and provides a more user-friendly interface to work with structured data.
pandas is commonly used for data cleaning, analysis, and manipulation.
scikit-learn
scikit-learn is a Python library used for machine learning tasks such as classification, regression, and clustering. It is built on top of NumPy and provides tools for data preprocessing and feature engineering.
scikit-learn is a popular choice for implementing machine learning algorithms due to its ease of use and scalability.
Matplotlib
Matplotlib is a Python library used for data visualization. It is built on top of NumPy and provides a range of tools for creating highly customizable plots, charts, and graphs.
Matplotlib can be used for a range of visualization tasks, from simple line graphs to highly complex 3D visualizations. In conclusion, NumPy is a vital library in data analysis and scientific computing that offers several data types and operations to optimize storage and data manipulation.
Additionally, by leveraging other powerful libraries like pandas, scikit-learn, and Matplotlib in conjunction with NumPy, data scientists can achieve more sophisticated data analysis and visualization tasks. 7) Practical Example 2: Manipulating Images With Matplotlib
Matplotlib is a powerful data visualization library that can also be used for image manipulation. In this example, we will look at how to manipulate images using Matplotlib.
First, let’s install Matplotlib using pip.
!pip install matplotlib
Next, we need an image to work with.
We can use the image of a cat provided by Matplotlib for this example.
import matplotlib.pyplot as plt
cat_img = plt.imread('https://matplotlib.org/stable/_images/stinkbug.png')
plt.imshow(cat_img)
plt.show()
The above code downloads the cat image and displays it using Matplotlib’s imshow
function. Now, let’s apply some operations on the image.
We can start by flipping the image horizontally using the fliplr
function.
import numpy as np
flipped_cat = np.fliplr(cat_img)
plt.imshow(flipped_cat)
plt.show()
The above code flips the cat image horizontally and displays it using Matplotlib’s imshow
function. Another operation you can perform on images is blurring.
To blur an image, we first need to create a kernel. In this example, we will use a 5×5 box kernel.
from scipy.signal import convolve2d
kernel = np.ones((5,5)) / 25
blurred_cat = convolve2d(cat_img, kernel, mode='same', boundary='symm')
plt.imshow(blurred_cat)
plt.show()