In the last article, we spoke about how you would go about preparing your data for your NLP model. In this article we are going to cover off some of the models/techniques we may use when implementing the models.
Bag of Words (BoW)
This is a very simple model. It simply involves a vocabulary of words along with the frequency of those words in the text.
Let’s do a little example. In the below, each line of text will be treated as a separate document and the full 4 lines will be the corpus of documents.
‘I am a fly and I like to fly,
High in the sky,
Because I am a fly,
And I like to be high.’
First we need to understand our vocabulary, the words we have are:
- I
- Am
- A
- Fly
- And
- Like
- To
- High
- In
- The
- Sky
- Because
- Be
Now, we’re going to create our bag of words. We can do this with binary scoring as below, or we can do a count of the number of times the word appears in the document.
Document 1 | Document 2 | Document 3 | Document 4 | |
i | 1 | 0 | 1 | 1 |
am | 1 | 0 | 1 | 0 |
a | 1 | 0 | 1 | 0 |
fly | 1 | 0 | 1 | 0 |
and | 1 | 0 | 0 | 1 |
like | 1 | 0 | 0 | 1 |
to | 1 | 0 | 0 | 1 |
high | 0 | 1 | 0 | 1 |
In | 0 | 1 | 0 | 0 |
the | 0 | 1 | 0 | 0 |
sky | 0 | 1 | 0 | 0 |
because | 0 | 0 | 1 | 0 |
be | 0 | 0 | 0 | 1 |
TF-IDF
Okay, so now we know how many times each word appears in the document. But that isn’t really all that useful. I wouldn’t mind betting that the top 10 or more of the words in your vectorized analysis are ‘the’, ‘sometimes’ or other stop words which don’t add any meaning to the document.
Term Frequency-Inverse Document Frequency (TFIDF) does two things:
- It looks at the frequency of each word across the entire corpus (all the documents) rather than one document in isolation.
- It inverses the weighting assigned to each word. For example, if ‘the’ is the most commonly occurring word, it will have the least importance to the model; while the word ‘vectorization’ may only occur once across all the articles, and hence it’s totally unique and must be an important term in our document.
In other words, we take the relative frequency of words across all articles and determine how important they are to the topic of the article.
TFIDF specifically:
- Calculates the term frequency (TF) – which is simply, how often that term occurs
- Weighs down (makes less important) frequently used words (like ‘is’ or ‘the’)
- Weighs up (makes them more important) the unique or less used terms in the document
Both of these approaches have a few limitations. Firstly, with a very large vocabulary your dataset can get a bit too big and sparsity is a problem too. Most of all, discarding word order & context in a document loses much of the document meaning & hence makes a weaker model.
Now consider a situation where we have 2 sentences:
- I don’t like cheese
- I like cheese very much
If we tokenize this sentence to each word; they both have the words ‘I’, ‘like’ and ‘cheese’. However, one is very positive and the other a very negative statement. The issue with the bag of words model is that it loses this context and hence would simply look at word frequency & likely produce a poor model. N-grams go some way to mitigating this issue.
N-Grams
N-Grams are a sequence of words ‘n’ words long. It’s just a way of daisy chaining neighbouring words in a document together. Let’s go back to our two sentences about cheese.
We can split these sentences into unigrams (one word at a time); bigrams (two words at a time); trigrams (three words at a time) and n-grams (whatever you define).
Unigram | ‘i’, ‘don’t’, ‘like’, ‘cheese’ | ‘I’, ‘like’, ‘cheese’, ‘very’, ‘much’ |
Bigram | ‘I don’t’, ‘don’t like’, ‘like cheese’ | ‘I like’, ‘like cheese’, ‘cheese very’, ‘very much’ |
Trigram | ‘I don’t like’, ‘don’t like cheese’ | ‘I like cheese’, ‘like cheese very’, ‘cheese very much’ |
You can see in the above trigram, we now have ‘don’t like cheese’ in the first statement and ‘I like cheese’ in the second. So suddenly, we have some context to what we are saying.
I’ve done a short practical implementation of an SVM using these concepts here.