NLP Series : The Basic Building Blocks of NLP

Recently, I have been presented with some problems which require Natural Language Processing to solve. Natural Language Processing (NLP) is a field of Artificial Intelligence which enables a machine to interpret human text for the purpose of analysis – for example, understanding the main topics discussed in a document or how positive customers reviews are about your product.

In  this article, I want to cover the basic building blocks for NLP, which we will then build upon in later articles.

An Intro To Spacy

Spacy is a very complete (and highly performant) NLP library, heavily used in the Python community. We’ll discuss some of its capabilities throughout this article. To get it installed, run the below commands:

pip install spacy
python3 -m spacy download en
python3 -m spacy download en_core_web_sm

At the very top of your code, you need to include the two lines below, which import the library and make the English language dictionary available to the library to use.

import spacy
nlp = spacy.load('en_core_web_sm')

Part Of Speech (POS)

Exploring our text is really very simple and quite enjoyable with Spacy. In the below, I have defined a new doc. I can print the text within that document; access word 4 specifically and then I can also interact with these pos_ and tag_ concepts.

import spacy
nlp = spacy.load('en_core_web_sm')

doc = nlp(u'This looks like a job for me, so everybody just follow me')

print(doc.text)
print(doc[4]) 
print(doc[4].pos_) 
print(doc[4].tag_) #past tense verb (VBD) - lookup ref needed

Here, the POS tag tells us the part of speect (e.g. is it a verb, noun, etc…). If we use doc.pos, we get the ID of the part of speech; while doc.pos_ gives us the name of that part of speech. The tag gives us a more granular view of that, there is a lookup table here.

Let’s run this for the whole doc:

for token in doc:
    print(f'{token.text:{10}}{token.pos_:{10}}{token.tag_:{10}}{spacy.explain(token.tag_):{10}}')


Spacy can also determine tense where we use the same word. Here, I have some pretty terrible English, but it’s to make a point. You can see from the below, that the word read has been classed as ‘present’ tense.

doc = nlp(u'I am read books')
word = doc[1]

for token in doc:
    print(f'{token.text:{10}}{token.pos_:{10}}{token.tag_:{10}}{spacy.explain(token.tag_):{10}}')


In the below, it’s been classed as past tense:

doc = nlp(u'I read a book')
for token in doc:
    print(f'{token.text:{10}}{token.pos_:{10}}{token.tag_:{10}}{spacy.explain(token.tag_):{10}}')


Alright, so what if we wanted to count the number of nouns, verbs, etc.. in a given document? We can do that by using the below.

doc = nlp(u'This looks like a job for me so everybody just follow me')
pos_count = doc.count_by(spacy.attrs.POS)

#the above outputs the POS ID, to get it into readable format, we need to do a lookup
for key, value in pos_count.items():
    print(f'{key:{10}}, {value:{10}}, {doc.vocab[key].text:{10}}')

You can visualize your Parts Of Speech like below:

import spacy
nlp = spacy.load('en_core_web_sm')
from spacy import displacy
doc = nlp(u'This looks like a job for me so everybody just follow me')

displacy.render(doc, style='dep', options={'compact': 'True', 'color': 'yellow', 'bg': 'black'})

Sentence Detection

Spacy can automatically split the sentences out of your document too. You can then loop through the sentences, using the article.sents syntax below.

article = nlp(u'As data engineers and data scientists, we’re spend a lot of time exploring data. When you’re working with huge datasets, you may find you need to utilise Apache Spark or similar to conduct this exploratative analysis but for the majority of use-cases, Pandas is the defaqto tool we choose.\nThe Pandas library has been so heavily adopted because of its simplicity – exploring your data is almost enjoyable as the library is intuitive and easy to use. However, sometimes, just sometimes, it does become a bit tedious. When you’re looking at relationships between many fields; when you’re filtering the dataframe time and time again to look at the data from different angles and when you’re creating tonnes of visualizations in MatplotLib or Seaborn – that is time consuming and a bit frustrating')

for sentence in article.sents:
    print(sentence)

We can then perform boolean operations on the specific words within the article to determine whether it’s the start of a sentence or not.

article[17].is_sent_start

Sometimes, we don’t want the default sentence detection rules implemented by Spacy. So we can write our own function. In the below, we split on the semi colon, rather than a full stop. We add it to our NLP pipeline as a task, so when we run NLP on the doc, it splits it on the semi colon for us.

import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp('"sentence 1; sentence 2" - sentence 3.')

def set_custom_rules(doc):
    for token in doc[:-1]:
        if token.text == ';':
            doc[token.i+1].is_sent_start = True #next index position starts sentence
            
    return doc

nlp.add_pipe(set_custom_rules, before='parser')

doc = nlp('"sentence 1; sentence 2" - sentence 3.')
for sent in doc.sents:
    print(sent)

Named Entity Recognition

If that weren’t useful enough, the platform also automatically identifies ‘entities’ – i.e. organizations, locations, money. As you can see below, it’s grouped London as a geographic location (country, city or state) and 2999 as a monetary value. You can see a lookup table for entities here.

doc3 = nlp(u'CompanyX are releasing the new Phone soon in London it will be £2999')

for entity in doc3.ents: 
    print(entity)
    print(entity.label_) 
    print(str(spacy.explain(entity.label_)))

Let’s take this a step further, in the below, I have written a little function to print the entity data nicely and have defined a new document. Let’s see what entities are automatically identified by Spacy.

def show_entities(doc):
    if doc.ents:
        for ent in doc.ents:
            print(f'{ent.text:{20}} {ent.label_:{20}} {spacy.explain(ent.label_):{10}}')
    else:
        print('no entities found')
        
doc = nlp(u'I live in London, UK, kdeycouk has an HQ here, people can work from home until next April. It costs 30 pounds to get the train to london')

show_entities(doc)

We can add ‘kdeycouk’ as a new entity under company name:

from spacy.tokens import Span
ORG = doc.vocab.strings[u'ORG']
new_entity = Span(doc, 7,8, label=ORG)
doc.ents = list(doc.ents) + [new_entity]

show_entities(doc)

We can visualize our named entities within our text:

from spacy import displacy
displacy.render(doc, style='ent')

Tokenization

The first concept we’re going to discuss is tokenization. This is the process of splitting our body of text out into ‘tokens’. That just means, split your text out from a sentence, into a list of words. For example ‘The dog barked at his owner’ would become [‘the’, ‘dog’, ‘barked’, ‘at’, ‘his’, ‘owner’].

To get started with tokenization, we need to create a document (a body of text to ingest).

doc = nlp('Microsoft word is not as good as google drive. 100%')

Now, if we were to run the below loop over the data, you’ll see a pretty cool output. We will get a breakdown of each word in the sentence, along with whether it’s a noun, verb, etc… This library is however, really very intelligent. If you have measurements (e.g. 50KM), Spacy will recognise that they should be grouped together, whereas, a percentage 100%, would be split into 100 and a % sign.

for token in doc:
    print(token.text, token.pos, token.pos_) 

Lemmatization

Okay, so the next thing which we need to do as part of an NLP project is to group all the words that effectively mean the same thing. Think of a document that has Run, Ran, Running within it. We need to take it back to the lemma; which is the base term in dictionary.

In the below, you can see that eat, eating and ate have all been lemmatized to be ‘eat’.

import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp('I enjoy eating burgers. I ate one yesterday and I am eating one now')
for token in doc:
    print(f'{token.text:{20}} {token.pos_:{20}} {token.lemma:{20}} {token.lemma_:{50}}')


While we are here, let’s talk about that text formatting. Here, we are printing the token text, part of speech etc… and forcing the layout into this table-like view, by including {20}, which determines the amount of space given to that item. I.e. evrry column is 20 wide.

Stop Words

Words like ‘and’ and ‘the’ appear so often in a document and add nothing to its meaning. So, we should filter them out before conducting our analysis, so that the frequency of those words does not impact our model score negatively. You can see the default stop words using the below line of code:

print(nlp.Defaults.stop_words)

And, you can add a stop word using:

nlp.Defaults.stop_words.add('wtf') 
nlp.vocab['wtf'].is_stop = True

You can remove a stop word using:

nlp.Defaults.stop_words.remove('wtf') #remove a stop word
nlp.vocab['wtf'].is_stop = False 

You can check if something is a stop word using:

nlp.vocab['wtf'].is_stop
Kodey