Adventures in Machine Learning

How to Count Unique Words in Python: Two Methods Explained

Counting the number of unique words in a String/File in Python

Are you a programmer who frequently finds themselves working with text data? Whether you’re analyzing customer reviews or parsing a document, counting the number of unique words in a string or file can provide valuable insights.

Thankfully, Python makes this task a breeze with its built-in string and file manipulation functions. In this article, we’ll explore two methods for counting unique words in a string or file using Python.

The first method involves using the str.split() function, set(), and len(), while the second method involves implementing a for loop in addition to these functions. Method 1:

Counting unique words in a String/File using str.split(), set(), and len()

The first method is straightforward and efficient for small datasets.

It involves converting a string or a file into a list of words, creating a set of unique words, and then counting the length of the set.

Counting unique words in a String

Let’s begin with an example of counting unique words in a string variable. “`python

sentence = ‘This is a sample sentence with several words and some repeated words’

unique_words = set(sentence.split())

print(len(unique_words))

“`

The output for this code segment would be 10 because there are 10 unique words in the string variable “sentence”. The above code can be broken down as follows:

– The `sentence` variable contains the string that we want to analyze.

– The `sentence.split()` function converts this string into a list of words. – The `set()` function creates a set of the unique words in this list.

– The `len()` function returns the number of items in the set.

Counting unique words in a text File

This method is also applicable when analyzing text data saved in a file. Heres how you would go about counting unique words in a text file.

“`python

with open(‘mytextfile.txt’, ‘r’) as f:

text = f.read()

unique_words = set(text.split())

print(len(unique_words))

“`

The above code segment:

– Opens the file “mytextfile.txt” in read mode and assigns it to the variable `f`. – Reads the contents of the file `f` and assigns it to the variable `text`.

– `text.split()` converts this string into a list of words. – `set()` creates a set of the unique words in this list.

– `len()` returns the number of items in the set. Method 2:

Counting unique words in a String/File using for loop

While the first method is efficient for small datasets, it may not be suitable for larger datasets.

Fortunately, Python is a versatile language that provides many approaches to a problem. A for loop is one such approach that can be used to count unique words in a string or a file containing large datasets.

Counting unique words in a String using for loop

In this method, we can start by splitting the string into a list of words and then iterate through each word with a for loop. We’ll check if the word has already appeared in a separate list.

If it hasn’t, we’ll add it to the list. Finally, we’ll count the length of the list to determine the number of unique words.

“`python

sentence = ‘This is a sample sentence with several words and some repeated words’

unique_words = []

for word in sentence.split():

if word not in unique_words:

unique_words.append(word)

print(len(unique_words))

“`

The output would be 10 because there are 10 unique words in the string variable `sentence`.

Counting unique words in a text File using for loop

To count unique words in a text file using the for loop method, we’ll follow similar steps as above. “`python

with open(‘mytextfile.txt’, ‘r’) as f:

text = f.read()

unique_words = []

for word in text.split():

if word not in unique_words:

unique_words.append(word)

print(len(unique_words))

“`

The above code segment:

– Opens the file “mytextfile.txt” in read mode and assigns it to the variable `f`. – Reads the contents of the file `f` and assigns it to the variable `text`.

– Splits the `text` variable into separate words using the `text.split()` function. – Iterates through each word using a for loop.

– Checks if the word has already appeared in the `unique_words` list. – If the word hasn’t appeared, it’s added to the `unique_words` list.

– Finally, the length of the list is counted using the `len()` function to determine the number of unique words.

Conclusion

Counting the number of unique words in a string or file is a simple yet powerful technique that can provide valuable insights into text data. Python’s string and file manipulation functions allow programmers to easily parse and analyze large datasets.

While there are different ways to approach this task, using a combination of functions like str.split(), set() and len() or for loops can make the job easier. Understanding these methods can help you in your data analysis endeavors.

Counting unique words in a string or file is a fundamental technique in text data analysis. Python provides two methods of counting unique words in a string or file; the efficient first method uses functions such as str.split(), set(), and len(), while the second method uses a for loop to iterate through the data.

With these approaches, it’s simple to understand and implement this technique. Programmers can use these methods to gain insights from text data and make impactful data-driven decisions.