Introduction to NLP and spaCy
Have you ever wondered how machines can communicate with us in our language just like humans? This is where Natural Language Processing (NLP) comes in.
NLP involves the use of computational techniques to enable the communication between machines and humans using natural language. In simpler terms, NLP equips computers with the ability to understand, interpret, and generate human language.
One powerful tool used in NLP is spaCy, a free, open source library designed to perform tasks such as information extraction, text classification, and natural language understanding. SpaCy is built on the Python programming language, and it offers many advantages over other NLP tools such as speed, user-friendliness, and efficient memory usage.
In this article, we will explore the basics of NLP, and delve into installing and using spaCy.
What is NLP?
NLP is a subfield of artificial intelligence (AI) that focuses on the interaction between human language and computers.
NLP is what enables machines to read text and understand its meaning. Likewise, it also allows machines to interpret spoken language and generate appropriate responses.
NLP has numerous applications such as:
- Sentiment analysis: Can be used to determine the polarity of text, i.e., whether it is positive, negative or neutral. Businesses can use sentiment analysis to evaluate customer feedback, and tailor their products based on customer needs.
- Machine translation: Allows translation of text from one language to another.
- Chatbots: Can be used to offer online and mobile customer services by addressing customer queries in a human-like manner.
- Speech recognition: Speech recognition technology is essential for virtual assistants such as Amazon Alexa and Apple Siri.
- Text summarization: Can be used to extract key information from long pieces of text.
What is spaCy and Its Capabilities?
SpaCy is an open-source software library for advanced natural language processing tasks.
It was developed by a team of developers at Explosion AI, with its first version being released in 2015. SpaCy comes equipped with pre-trained models for several languages, including English, which is the most widely used language in the world.
It can be used for a whole range of NLP tasks, including:
- Tokenization: Segmenting text into individual words or sentences.
- Named entity recognition (NER): Identifying entities such as people, organizations, and locations in a text.
- Part-Of-Speech (POS) tagging: Identifying the parts of speech of every word in a sentence.
- Dependency parsing: Identifying the grammatical relationships between the words in a sentence.
- Lemmatization: Attempting to reduce words to their base forms (e.g., runs to run).
- Text classification: Assigning categories to text according to its content.
Installation of spaCy
Now that we have a better understanding of spaCy and the benefits it offers, let’s walk through the installation process.
Step 1: Install spaCy
Open up your command-line interface (CLI) and type the command below to install spaCy.
pip install spacy
Step 2: Download the Language Model
After installing spaCy, the next step is to download a language model that it can use. SpaCy has several pre-trained models for different languages and tasks.
For English, type the following command to download the model.
python -m spacy download en_core_web_sm
This command downloads a small English language model. If you need a more comprehensive, larger English language model, you can use the command below.
python -m spacy download en_core_web_md
Step 3: Test the Installation
To confirm that spaCy is installed correctly, execute the following code in a Python environment.
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp("This is a test sentence.")
for token in doc:
print(token.text, token.pos_, token.dep_)
This code loads the pre-trained model, processes a sample sentence, and outputs the tokens (words) of the sentence with their associated POS tags and dependency relationships.
Conclusion
In conclusion, spaCy is a powerful tool for natural language processing, with vast capabilities and a user-friendly environment. We hope that this article has equipped you with the basic understanding of NLP and the installation of spaCy. Now, you can begin exploring the functionality of spaCy and create your own NLP application.
Enjoy!
3) The Doc Object for Processed Text
To get started with text processing in spaCy, the first step is to load a language model instance. A language model is a type of model that can predict the likelihood of a sequence of words.
This model is necessary for text processing tasks such as POS tagging, sentence segmentation, and named entity recognition. SpaCy provides several language models that you can download to use.
There are models for different languages, and each model has a size that varies between small, medium, and large. A typical way to load a language model instance in spaCy is as follows:
import spacy
nlp = spacy.load('en_core_web_sm')
Here, `en_core_web_sm` is the name of the English language model we are loading. This command instructs spaCy to create a language model instance called `nlp` that uses the `en_core_web_sm` model.
With the language model instance loaded, we can now instantiate a Doc object to process an input string of text. A Doc object is a container for accessing linguistic annotations that are generated by the language processing pipeline.
doc = nlp('This is a sentence.')
Here, we create a Doc object called `doc` by passing a string of text to the nlp instance. This command runs the text through the processing pipeline of the loaded language model and generates the desired linguistic annotations, such as POS tagging and sentence segmentation.
4) Sentence Detection
Sentence detection, also known as text segmentation, is the task of locating where sentences begin and end in a piece of text. This task is essential as many downstream NLP tasks require input that is already segmented into sentences.
SpaCy provides a pre-trained sentence segmentation model as part of its language processing pipeline. To perform sentence detection with spaCy, we can simply call the `sents` attribute of a Doc object, which returns an iterable of spans, representing each sentence in the document.
doc = nlp('This is the first sentence. This is the second sentence.')
for sent in doc.sents:
print(sent)
In this example, the `sents` attribute of our `doc` object gives us an iterable of the two sentences in our input text.
We can now iterate over the spans and perform further processing on each sentence separately. SpaCy’s sentence segmentation model is trained using rule-based heuristics that work well for most text.
However, there may be cases where these heuristics do not work correctly. In such cases, we can customize the sentence detection behavior of the pipeline by adding custom rules.
For example, we can specify custom delimiters to split sentences.
from spacy.pipeline import SentenceSegmenter
def custom_seg(doc):
'''Look for ellipses and split sentences on them.'''
ellipses = doc.text.count('...')
if ellipses:
start = 0
for token in doc:
if token.text == '...':
doc.sent[start:token.i].end = token.i - 1
start = token.i + 1
doc.sent[start:doc[-1].i].end = doc[-1].i
return [sent for sent in doc.sents]
nlp.add_pipe(SentenceSegmenter(custom_seg), before='parser')
In this example, we define a custom sentence segmentation function that looks for ellipses (`…`) and splits sentences on them.
We then use this function to create a sentence segmentation pipeline component and add it to our existing pipeline. Note that the custom sentence segmentation function receives a Doc object as input and is expected to return a list of spans that represent the sentences in the input Doc.
To create this list of spans, we set the `doc.sent` attributes to the appropriate slice of tokens that belong to each sentence.
Conclusion
In this article, we have covered the basics of loading a language model instance and instantiating a Doc object to process input text in spaCy. We have also discussed the importance of correctly detecting sentence boundaries in text and how spaCy provides a pre-trained sentence segmentation model as part of its language processing pipeline. Finally, we have seen how to customize sentence detection behavior using custom delimiters, allowing for greater flexibility in the pipeline.
With this knowledge, you can start exploring the full capabilities of spaCy and creating custom pipelines to meet your specific NLP needs.
5) Tokens in spaCy
Tokenization is the process of splitting text into individual words or tokens. A token is an atomic unit of meaning, which can be a word, punctuation mark, or a number.
SpaCy uses a rule-based tokenization scheme that efficiently splits text into useable tokens. In spaCy, the Token object represents a single token, and it contains various linguistic attributes such as tokenized text, lemmatization, and part-of-speech (POS) tagging, accessible through properties.
To access token attributes in spaCy, first, we need to create a Doc object for our text input. We can access the individual tokens in the Doc object by iterating over the `doc` object as shown below.
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp('This is a test sentence.')
for token in doc:
print(token.text, token.lemma_, token.pos_, token.dep_)
Here, we iterate over all the tokens in the Doc object. For each token, we print out its text, lemma, POS tag, and dependency label.
We can also customize tokenization to detect tokens on custom characters. For example, in the above code snippet, spaCy tokenizes the input string by whitespace.
We can change this behavior by adding rules to the tokenizer. In the following example, we define a custom tokenizer to split sentences on hyphenated words.
import spacy
from spacy.tokens import Doc
nlp = spacy.load('en_core_web_sm')
def custom_tokenizer(nlp):
# Adds tokenizer rules to detect hyphenated words
hyphen_exceptions = ["quick-witted", "self-motivated"]
prefixes = [prefix.lower() for prefix in nlp.Defaults.prefixes if "-" in prefix]
infixes = [infix.lower() for infix in nlp.Defaults.infixes if "-" in infix]
suffixes = [suffix.lower() for suffix in nlp.Defaults.suffixes if "-" in suffix]
rules = nlp.Defaults.tokenizer_exceptions
for exc in hyphen_exceptions:
rules[exc] = [{"ORTH": exc}]
# we can add some new rules as well
for prefix in prefixes:
rules[prefix] = [{"TEXT": prefix.split('-')[0]}, {"ORTH": '-'}, {"TEXT": prefix.split('-')[1]}]
for infix in infixes:
rules[infix] = [{"ORTH": infix.replace('-', '')}]
for suffix in suffixes:
rules[suffix] = [{"TEXT": suffix.split('-')[0]}, {"ORTH": '-'}, {"TEXT": suffix.split('-')[1]}]
return Doc.Defaults.create_tokenizer(nlp)
nlp.tokenizer = custom_tokenizer(nlp)
doc = nlp("This is a quick-witted person, and he is self-motivated.")
for token in doc:
print(token.text)
Here, we define a custom tokenizer `custom_tokenizer` and add rules to tokenize hyphenated words together. The `hyphen_exceptions` array contains a list of hyphenated words we want to skip over.
We define separate lists of prefixes, infixes, and suffixes containing hyphenated substrings and their tokenization. Finally, we return a pre-configured Doc object’s tokenizer along with the new rules.
6) Stop Words
Stopwords are a specific type of words that are commonly used in language but do not carry significant meaning themselves. They are usually removed from text to reduce computational complexity and noise in the data.
In English, stop words include words such as “a,” “an,” “the,” “is,” and “and.”
Luckily, spaCy provides built-in support for stop words removal. To remove stop words from a string of text, we can use spaCy’s `is_stop` attribute on each token in a sentence.
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp('This is a test sentence.')
filtered_text = [token.text for token in doc if not token.is_stop]
Here, we iterate through each token in the Doc object and filter out stop words using the `is_stop` attribute. The `filtered_text` variable contains a list of non-stop words.
To customize stop words, spaCy provides a simple API to customize the set of stop words specific to the given language. We can retrieve the default stop words by importing `STOP_WORDS` from the `spacy.lang` module:
import spacy
nlp = spacy.load('en_core_web_sm')
stop_words = nlp.Defaults.stop_words
print(stop_words)
This produces a set of default stop words available in the loaded language model. By adding or removing words to this set, we can customize the stop word list for our needs.
We can do this by simply manipulating the set object:
import spacy
nlp = spacy.load('en_core_web_sm')
custom_stop_words = set(['hello', 'world'])
nlp.Defaults.stop_words |= custom_stop_words
doc = nlp('Hello world, this is a test sentence.')
filtered_text = [token.text for token in doc if not token.is_stop]
In this example, we add ‘hello’ and ‘world’ to the stop words set. We then generate a Doc object from a sample input string, filter out stop words and print the filtered result.
Conclusion
Tokenization and stop words are both essential tools for effective NLP. In this article, we covered tokenization in spaCy and learned about tokenization customization to suit our specific needs.
We also covered stop words in the English language and how spaCy provides support for stop words removal. With this knowledge, you can add more advanced processing capabilities to your NLP pipelines and extract more critical insights from your data.
7) Lemmatization
Lemmatization is the process of reducing the inflected forms of a word to their base or root form. It is an essential pre-processing step in natural language processing that reduces the complexity of text data by grouping together words that have the same meaning.
SpaCy provides support for lemmatization as part of its overall language processing pipeline.
To see how lemmatization works in spaCy, we can use an example string with different forms of the word “organize”:
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp("organizes organizing organized")
for token in doc:
print(token.text, token.lemma_)
This code prints the tokenized text along with its corresponding lemma. We can see that spaCy correctly lemmatizes the various forms of the verb “organize” to its base form “organize”.
Lemmatization reduces the number of tokens in the text and makes it easier to analyze the data. SpaCy uses a pre-configured set of lemmatization rules that are available in the language model, so we don’t need to specify them manually.
SpaCy also takes into account the context in which the word is used when determining its correct lemma.
8) Word Frequency
Word frequency is an essential measure in text analysis that determines how often each word appears in a piece of text. The frequency of a word can be a valuable insight into the content of the text and can be useful in several NLP tasks such as topic modeling, sentiment analysis, and named entity recognition.
To determine the word frequency in spaCy, we can create a frequency distribution of words in a given text using Python’s Counter class. Let’s see an example below:
import spacy
from collections import Counter
nlp = spacy.load('en_core_web_sm')
doc = nlp("This is a test sentence. This is another test sentence.")
word_frequencies = Counter()
for token in doc:
if not token.is_stop:
word_frequencies[token.lemma_] += 1
print(word_frequencies)
Here, we iterate through each token in the Doc object, filter out stop words, and increment the count for the corresponding lemma in the Counter object. The Counter object stores the frequency of each word in the text.
This approach effectively calculates the frequency of each word in the text, excluding stop words, and provides valuable insights into the text’s content.
Conclusion
In this article, we have covered various aspects of text processing with spaCy, including loading language models, creating Doc objects, sentence detection, tokenization, stop words removal, lemmatization, and word frequency analysis. These capabilities are essential for effective NLP, and with spaCy, you can easily implement these techniques and create sophisticated text processing pipelines.
With this knowledge, you can start exploring the full capabilities of spaCy and building custom pipelines for your specific NLP needs. Keep exploring and enjoy the world of natural language processing!