Adventures in Machine Learning

Mastering Random Sampling in Python: A Comprehensive Guide

Random Sampling in Python

Python is an incredibly flexible programming language that is used extensively for a wide range of applications. One of the most important tasks in data science is collecting a random sample of data.

Python has a powerful tool known as the random.sample() function that can be used for this purpose. In this article, we’ll explore how to use this function to perform random sampling from a list and a set in Python.

1) Random Sampling: Definition and Importance

Random sampling is a statistical technique that involves selecting a subset of data from a larger population.

This selection is done in such a way that each element in the population has an equal chance of being selected. Random sampling is commonly used in research, data science, and many other fields.

2) The random.sample() Function

Python provides several functions for performing random sampling, and one of the most commonly used ones is random.sample().

The random.sample() function allows users to get a random sample of a given size from a list, tuple, or set of elements without any repetition.

This function is built into Python’s random module, which provides several other functions for generating random numbers and data.

How to use random.sample() function

The syntax for the random.sample() function is straightforward.

To use this function, we need to provide the list or set that we want to sample, along with the number of items we wish to sample.

The general syntax is as follows:

random.sample(population, k)

Where population is the list, tuple, or set from which we want to sample, and k is the number of items we wish to sample.

Example: random.sample() function to select multiple items from a list without repetition

Let’s say we have a list of items that we want to sample randomly. For example, to create a list of fruits, we could write:

fruits = ['apple', 'banana', 'cherry', 'date', 'elderberry', 'fig', 'grape']

Now, suppose we want to randomly select three fruits from the list without repeats.

We can use the random.sample() function as follows:

selected_fruits = random.sample(fruits, 3)

This code will choose three fruits from the list at random, with no repeats, and store them in the variable selected_fruits.

Points to remember about random.sample()

It’s important to remember that random.sample() only works with collections that support indexing and slicing, such as lists and tuples.

Also, the size of the random sample cannot be greater than the size of the collection being sampled. If k is greater than the length of the population, the function will raise a ValueError.

Random sampling with replacement to including repetitions

By default, the random.sample() function returns a list of unique items with no repeats. But what if we want to generate a list of random integers with repetitive numbers?

To include repetitions in our sampled list, we can use the random.choices() function instead. The random.choices() function has similar parameters as the random.sample() function.

The difference is that random.choices() allows repetitions while random.sample() does not. For instance, to generate a list of random integers with repetitions, we can write:

random.choices(range(10), k=5)

Generate the sampled list of random integers

Another critical function for dealing with random data in Python is random.shuffle(). This function shuffles a list or set of elements randomly.

This function operates in-place and doesn’t return a new list.

Suppose we have a list of integers from 1 to 10, and we want to shuffle them randomly.

We can use the random.shuffle() function as follows:

numbers = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
random.shuffle(numbers)
print(numbers)

This code will shuffle the elements of the list numbers randomly, generating a new list with the same elements in a different order.

Random Sampling from a Python set

Sets are another data structure in Python that support random sampling. A set is an unordered collection of unique elements with no duplicates.

As the nature of elements in a set already ensures randomness and having no duplication, the random.sample() function only provides one use, which is taking a random sample from an unordered data structure.

To use a set in Python, we can declare a variable with the set() function. For example, we can create a set of colors like this:

colors = set(['blue', 'green', 'red', 'yellow', 'orange', 'purple'])

Example: random.sample() function to select random items from a set

To select a random item from a set in Python, we can use the random.sample() function as well.

For example, we can use the following code to select a random item from the colors set:

random_color = random.sample(colors, 1)

This code will select a random color from the set colors and store it in the variable random_color.

Conclusion

In conclusion, random sampling is an essential statistical technique used in a wide range of applications, including data science and research. Python provides several functions for performing random sampling, such as random.sample(), random.choices(), and random.shuffle().

By using a combination of these functions, Python programmers can generate random data for their applications with ease.

3) Random Sampling from a Python dictionary

Python dictionaries are another useful data structure that can be useful in random sampling. Dictionaries are collections of key-value pairs and are used to store and retrieve data in an unordered manner.

Dictionaries make it easy to associate keys with values, and therefore suitable for the construction of mapping data.

To use a dictionary in Python, we can declare a variable with the {} brackets or call the dict() function. For example, we can create a dictionary of people with their age like this:

people = {'Jon': 28, 'Ada': 35, 'Lila': 24, 'Ben': 45, 'Kayla': 30}

Example: random.sample() function to select random key-value pairs from a dictionary

To select a random key-value pair from a dictionary in Python, we can use the random.sample() function with a few modifications.

For example, we can use the following code to select a random person with their age from the people dictionary:

random_person = random.sample(people.items(), 1)

The items() function of a dictionary returns a list containing a tuple for each key-value pair in the dictionary, and the random.sample() function can select a random tuple from that list. The resulting random_person variable will contain a tuple with a person’s name as the key and their age as the value.

4) Random Seed to get the same sample list every time

Random sampling is inherently unpredictable as it selects elements from a collection at random. In some instances, we may want the same sample to be generated each time we run our program.

In such a situation, we can use a random seed. By using a random seed, we can set the starting point for the random number generator, ensuring that the same random numbers will be generated each time the seed is set.

To set a random seed, we can use the random.seed() function.

The seed value can be any integer, and it will set the initial state of the random number generator.

Example: random.seed() function to get the same sampled list every time

Consider the following example, where we have a list of numbers, and we want to select a random sample of five numbers from the list each time we run our program:

numbers = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

random.seed(0)
random_numbers = random.sample(numbers, 5)
print(random_numbers)

random.seed(0)
random_numbers = random.sample(numbers, 5)
print(random_numbers)

In this code, we use the random.seed(0) function to set the initial state of the random number generator to 0.

We then use the random.sample() function to select five random numbers from the list each time we execute this code. Running this code multiple times will always generate the same set of random numbers because we specified the same initial state for the random number generator using the random.seed() function.

This technique is useful when we need to ensure that our code generates the same set of results each time it runs. Using identical data repeatedly for testing is a typical scenario where reusability of test cases is executed.

Conclusion

In this article, we have seen how to use the random.sample() function in Python to select random elements from a list, set, and dictionary. We have also learned how to use the random.seed() function to obtain the same set of random numbers every time the code is executed.

Random sampling is useful in various situations, and it is an essential tool in data science and research. Python’s built-in random module makes it a useful resource for generating random numbers and data, simplifying the process of conducting statistical analysis and testing.

5) Getting a sample array from a multidimensional array

In Python, multi-dimensional arrays are often used to represent complex data structures. Multidimensional arrays can be thought of as tables of elements with rows and columns, and they can be created from nested lists or the NumPy library.

NumPy is a popular Python library for scientific computing and provides a wide range of functions for dealing with multi-dimensional arrays.

To create a multidimensional array in Python, we can use a nested list. A two-dimensional array is a list of lists and can be created as follows:

arr = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]

This creates a two-dimensional array with three rows and three columns.

We can then access elements of the array using two indices, one for the row and one for the column.

Example: numpy.random.choice() function to pick multiple random rows from the multidimensional array

To select a random sample of rows from a two-dimensional array, we can use the numpy.random.choice() function from the NumPy library.

This function selects a random sample of elements from a given array using a specified sample size. For example, we can use the following code to select two random rows from the array arr:

import numpy as np

arr = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
random_rows = np.random.choice(arr, size=2, replace=False)

In this code, we first import the NumPy library and create the multidimensional array arr using the numpy.array() function.

We then use the np.random.choice() function to select two random rows from the array, with the replace parameter set to False to ensure that the selected rows are unique. The resulting random_rows variable will contain a sample of two rows from the original array arr.

This method applies to higher dimensional arrays as well. For example, to generate a sample containing two subarrays with shape (2,3) from a multidimensional array with shape (3,2,3), we can write:

arr = np.random.rand(3,2,3)
arr_sampled = np.random.choice(arr, size=(2,3), replace=False)

This will generate a 2×3 array of two random 2×3 sub-arrays from the original, three-dimensional array.

6) Error and exception

Despite its simplicity, the random.sample() function can sometimes behave counter-intuitively when not used correctly, generating errors or exceptions.

Possible errors while using the random.sample() function

One possible error is the ValueError.

This error occurs when the sample size k is greater or equal to the size of the population, as cannot draw a unique sample of size greater or equal to population size. The error may occur when the function is called with k equal to or greater than the length of the collection being sampled.

Another possible error is the TypeError. This error occurs if the population provided cannot support indexing or slicing because the random.sample() function only works with iterable objects that can be indexed.

A third possible error is the KeyError. This error occurs when we try to select an item from a dictionary that doesn’t have the key passed to the function.

In Summary

In this article, we have discussed how to get a sample array from a multidimensional array in Python. Multidimensional arrays play a significant role in data science and scientific computing, and Python libraries like NumPy provide a wide range of functions for dealing with arrays.

We have also discussed the potential errors and exceptions that can occur when using the random.sample() function in Python. It is essential to carefully review one’s code to avoid such errors and to use data validation techniques to ensure data integrity in research and data science.

In this article, we have explored the topic of random sampling in Python, including the use of the random.sample() function to select random elements from a list, set, and dictionary. We’ve learned how to use the NumPy library to extract a sample from a multidimensional array.

The article has also emphasized the importance of being aware of potential errors while using the random.sample() function and provided insights on how to avoid them. Random sampling is an essential tool used for data science and research, and the Python language with its built-in random module and many external libraries make it easier for users to perform statistical analysis efficiently.

Data validation and careful consideration of potential errors remain crucial in data analysis and research.

Popular Posts