Pattern Matching In Large Bodies Of Text Using Spacy’s NLP Capabilities For Data Scientists

Pattern matching can be really very useful in large bodies of text. Yes, you could, for the most part, use a regular expression to achieve the same thing; but I tend to find the Spacy functionality a lot more straightforward – afterall, nobody enjoys the syntax of a regular expression.

Further to this, Spacy makes it easy to understand the context around a term. So, if your company name was mentioned in a document, you’d be able to understand the context in which it was mentioned. You could then use that snippet of text & run sentiment analysis and topic modelling scripts to find out exactly what people think of your company.

Let’s look at some examples of pattern matching. First and foremost, we import the libraries:

import spacy
nlp = spacy.load('en_core_web_sm')
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)

Now below, I have defined six patterns. Three of the patterns are related to different ways that the word Racecar might appear in the document and the remaining patterns are related to Formula One.

  • Pattern 1, simply checks if he word ‘racecar’ exists – irrespective of case (upper, lower)
  • Pattern 2 looks for the word race, followed by some sort of punctuation, followed by the word car. For example Race-Car.
  • Pattern 3 looks for race and car as two separate words ‘race car’.

Patterns 4 and 6 follow the same structure as patterns 1 and 3. The fifth pattern however, is slightly different. Here, we see in the middle we have ‘OP’:’*’ defined. This tells spacy to accept any number of punctuation points. So we could have Formula…….One.

pattern1 = [{'LOWER': 'racecar'}] 
pattern2 = [{'LOWER': 'race'}, {'IS_PUNCT': True}, {'LOWER': 'car'}] 
pattern3 = [{'LOWER': 'race'}, {'LOWER': 'car'}] 

pattern4 = [{'LOWER': 'formulaone'}] 
pattern5 = [{'LOWER': 'formula'}, {'IS_PUNCT': True, 'OP': '*'}, {'LOWER': 'one'}] 
pattern6 = [{'LOWER': 'formula'}, {'LOWER': 'one'}] 

Now, we add two matchers. The first, is called the racecar_matcher and will record any matches with the patterns 1, 2 or 3. The second matcher will record any words which match the patterns 4, 5 or 6.

matcher.add('racecar_matcher', None, pattern1, pattern2, pattern3)
matcher.add('category_matcher', None, pattern4, pattern5, pattern6)

Now, let’s define our document:

doc = nlp(u'The formula one racing in 2020 is been interrupted by covid. Formula---one is the premier racing category with very fast race-car. FormulaOne is a great sport for racecar')

Now, we can run the matcher! You can see in the output below, we have the match ID, followed by the starting position (in this case word 1 (formula)) and the ending position (in this case word 3). So that’s great, we have a list of all the matches for our patterns.

found_matches = matcher(doc)

But only, that isn’t so useful. You want to know exactly which words it matched on. So we can loop through and print them out.

for match in found_matches:
    span = doc[match[1] : match[2]]
    print(match[0], span.text)

Now, that’s useful. But let’s get to the real value of Spacy. If, in a very large document, you would like to understand the context in which the term is referenced, we can alter the span. A span is a small chunk of a document. Here, we define that span as 5 words prior to the match and 5 words after the match. As you can see from the below, we now have enough information to understand more of the context around the phrase.

for match in found_matches:
    span = doc[match[1]-5 : match[2]+5]
    print(match[0], span.text)

Now, this works well for patterns. We can also use a phrase matcher to search for complete phrases, like below:

from spacy.matcher import PhraseMatcher
matcher = PhraseMatcher(nlp.vocab)

doc = nlp(u'The formula one racing in 2020 is been interrupted by covid. Formula---one is the premier racing category with very fast race-car. FormulaOne is a great sport for racecar')
phrase_list = ['formula one racing', 'premier racing']
phrase_patterns = [nlp(text) for text in phrase_list]
matcher.add('matcher', None, *phrase_patterns)
found_matches = matcher(doc)