Adventures in Machine Learning

How to Count Unique Words in Python: Two Methods Explained

Counting the number of unique words in a String/File in Python

Are you a programmer who frequently finds themselves working with text data? Whether you’re analyzing customer reviews or parsing a document, counting the number of unique words in a string or file can provide valuable insights.

Thankfully, Python makes this task a breeze with its built-in string and file manipulation functions. In this article, we’ll explore two methods for counting unique words in a string or file using Python.

Method 1: Counting unique words in a String/File using str.split(), set(), and len()

The first method is straightforward and efficient for small datasets.

It involves converting a string or a file into a list of words, creating a set of unique words, and then counting the length of the set.

Counting unique words in a String

Let’s begin with an example of counting unique words in a string variable.

sentence = 'This is a sample sentence with several words and some repeated words'
unique_words = set(sentence.split())
print(len(unique_words))

The output for this code segment would be 10 because there are 10 unique words in the string variable “sentence”. The above code can be broken down as follows:

  • The sentence variable contains the string that we want to analyze.
  • The sentence.split() function converts this string into a list of words.
  • The set() function creates a set of the unique words in this list.
  • The len() function returns the number of items in the set.

Counting unique words in a text File

This method is also applicable when analyzing text data saved in a file. Here’s how you would go about counting unique words in a text file.

with open('mytextfile.txt', 'r') as f:
    text = f.read()
unique_words = set(text.split())
print(len(unique_words))

The above code segment:

  • Opens the file “mytextfile.txt” in read mode and assigns it to the variable f.
  • Reads the contents of the file f and assigns it to the variable text.
  • text.split() converts this string into a list of words.
  • set() creates a set of the unique words in this list.
  • len() returns the number of items in the set.

Method 2: Counting unique words in a String/File using for loop

While the first method is efficient for small datasets, it may not be suitable for larger datasets.

Fortunately, Python is a versatile language that provides many approaches to a problem. A for loop is one such approach that can be used to count unique words in a string or a file containing large datasets.

Counting unique words in a String using for loop

In this method, we can start by splitting the string into a list of words and then iterate through each word with a for loop. We’ll check if the word has already appeared in a separate list.

If it hasn’t, we’ll add it to the list. Finally, we’ll count the length of the list to determine the number of unique words.

sentence = 'This is a sample sentence with several words and some repeated words'
unique_words = []
for word in sentence.split():
    if word not in unique_words:
        unique_words.append(word)
print(len(unique_words))

The output would be 10 because there are 10 unique words in the string variable sentence.

Counting unique words in a text File using for loop

To count unique words in a text file using the for loop method, we’ll follow similar steps as above.

with open('mytextfile.txt', 'r') as f:
    text = f.read()
unique_words = []
for word in text.split():
    if word not in unique_words:
        unique_words.append(word)
print(len(unique_words))

The above code segment:

  • Opens the file “mytextfile.txt” in read mode and assigns it to the variable f.
  • Reads the contents of the file f and assigns it to the variable text.
  • Splits the text variable into separate words using the text.split() function.
  • Iterates through each word using a for loop.
  • Checks if the word has already appeared in the unique_words list.
  • If the word hasn’t appeared, it’s added to the unique_words list.
  • Finally, the length of the list is counted using the len() function to determine the number of unique words.

Conclusion

Counting the number of unique words in a string or file is a simple yet powerful technique that can provide valuable insights into text data. Python’s string and file manipulation functions allow programmers to easily parse and analyze large datasets.

While there are different ways to approach this task, using a combination of functions like str.split(), set() and len() or for loops can make the job easier. Understanding these methods can help you in your data analysis endeavors.

Counting unique words in a string or file is a fundamental technique in text data analysis. Python provides two methods of counting unique words in a string or file; the efficient first method uses functions such as str.split(), set(), and len(), while the second method uses a for loop to iterate through the data.

With these approaches, it’s simple to understand and implement this technique. Programmers can use these methods to gain insights from text data and make impactful data-driven decisions.

Popular Posts