NLP Crash Course (Part 1): Preparing data for NLP

NLP Crash Course (Part 1): Preparing data for NLP

Natural Language Processing (NLP) is one of the major topics in data science. We use it to understand the major themes/topics within a body of text; to understand the sentiment of text or to categorize text. In this article; we will go through the core concepts of NLP.

At a high level, NLP projects follow a similar workflow to a traditional machine learning project: data cleaning (text processing); feature extraction and modeling. In this article, we will discuss data preparation / cleaning.

Within this phase, we need to do a number of things to extract only the important concepts from a body of text & to understand context. Additionally, we need to use some of the techniques below for noise removal so we can ensure a much more consistent dataset being passed into our models.

Handle Tags

Firstly, if you’ve scraped the data from the web; there are likely loads of HTML tags that you need to get rid of. Step 1 in the cleaning process is to drop these from the text.

Case Normalization

Next, we ensure that our words are all in the same case. We call this normalization, ensuring the whole body of text is either in upper or lower case.


Now we split the string and tokenize the words.When we say tokenize, we mean that we are turning words into numeric tokens. So the string “My dog’s name is Polly” will tokenize into [1, 2, 3, 4, 5]. If we then have a sentence which says “My dog’s name is Fido” it will tokenize into [1, 2, 3, 4, 6]. Here, we can assess sentence similarity without losing the differences.

Remove Stopwords

Now we need to remove all the junky words in the text – things that don’t really mean anything. We call these stop-words. They are things like: ‘the’, ‘then’, ‘and’. They make the sentences more readable to humans but just act as clutter for the machine & make it hard to identify the sentence topic / meaning.

Parts of Speech (POS) Tagging

Next, we’re going to use Parts of Speech (POS) Tagging. This is where we understand whether a word is a verb, noun, etc.. Why is this helpful? Well, by understanding the POS in addition to the sequencing of the words, it makes it possible for the algorithm to better understand the context around that representation of the word – in English, some words can be both a noun and a verb depending on their context (for example fly (an animal) versus fly (the verb)). When we come to create our bag of words; we will then have 2 instances of the word ‘fly’ as they’re in different contexts – making the model better as it helps to reduce ambiguity from text. Below are the parts of speech identified by Spacy (an NLP library).

  • ADJ: adjective, e.g. big, old, green, incomprehensible, first
  • ADP: adposition, e.g. in, to, during
  • ADV: adverb, e.g. very, tomorrow, down, where, there
  • AUX: auxiliary, e.g. is, has (done), will (do), should (do)
  • CONJ: conjunction, e.g. and, or, but
  • CCONJ: coordinating conjunction, e.g. and, or, but
  • DET: determiner, e.g. a, an, the
  • INTJ: interjection, e.g. psst, ouch, bravo, hello
  • NOUN: noun, e.g. girl, cat, tree, air, beauty
  • NUM: numeral, e.g. 1, 2017, one, seventy-seven, IV, MMXIV
  • PART: particle, e.g. ’s, not,
  • PRON: pronoun, e.g I, you, he, she, myself, themselves, somebody
  • PROPN: proper noun, e.g. Mary, John, London, NATO, HBO
  • PUNCT: punctuation, e.g. ., (, ), ?
  • SCONJ: subordinating conjunction, e.g. if, while, that
  • SYM: symbol, e.g. $, %, §, ©, +, −, ×, ÷, =, :), 
  • VERB: verb, e.g. run, runs, running, eat, ate, eating

Named Entity Recognition (NER)

Named entity recognition helps sort unstructured data and detect important information. For example, we can automatically identify: people, places, brands, dates (e.g. 20th December), timeframes (e.g. 10 months), values (e.g. 20%) monetary values. The entity type can form a new feature for our model, to further improve accuracy by better understanding the context.

Predict Syntactic Dependencies 

This is how we identify how words in a sentence are related. For example, if the word is the subject of a sentence. If we had ‘Barbara ate sausages’; then ‘Barbara’ would be the subject of the sentence and ‘sausages’ would be the object. There is a list of dependency tags for Spacy here. The key tags are ‘nsubj’ which is the nominal subject (Barbara); ‘dobj’ which is the direct object (sausage).

Here is the output from Spacy:

Barbara PROPN nsubj ate

ate VERB ROOT ate

sausages NOUN dobj ate

Handle Synonyms

The next phase is to handle synonyms where required. For example, if you have ‘store’, ‘shop’, ‘supermarket’ and ‘boutique’; they all fundamentally mean the same thing & by standardizing the word used, you can improve the model understanding of the body of text.

Stemming & Lemmatization

Stemming is about cutting back the word to its core part. For example ‘eating’, ‘eaten’, ‘eats’ all are derived from the word ‘eat’. So we can stem the string back to this core component part. This is not always very successful. Think about ‘studies’ and ‘studying’. The stem would be ‘studi’ and ‘study’. This is not quite what we are looking for.

Lemmatization has a much more detailed dictionary. It looks through the dictionary and converts the word to its ‘lemma’. So in the same example above, both of the examples would be output as simply ‘study’.

In the next article, we will start looking at implementation concepts.


Below is a messy example, showing how easy it is to manage many of these concepts using libraries such as Spacy. The processes you include & the ordering of those processes will very much depend on your own use cases.

import spacy
nlp = spacy.load('en_core_web_sm')

def cleanup(row):
    review_txt = row['Review_Text'].lower()

    #remove contractions
    p = []
    for word in review_txt.split():
        out = word.split("'")[0]
    #Loop through tokens, remove stopwords
    cleaned = []
    stopwords = nlp.Defaults.stop_words
    for x in p:
        if x.lower() not in stopwords:

    #Remove unwanted POS
    review_v2 = ' '.join([str(x) for x in cleaned])
    review_v2 = nlp(review_v2)
    pos = []
    for y in review_v2:
        if y.pos_ in ['ADJ', 'DET', 'NOUN', 'PROPN']:
    review_v3 = ' '.join([str(x) for x in pos])
    review_v3 = nlp(review_v3)
    lem = []
    for y in review_v3:
    return lem            

df['prepared'] = df.apply(cleanup, axis=1)
Previous Article
Important clustering concepts to keep in mind

Important clustering concepts to keep in mind

Next Article
Approaches to text classification

NLP Crash Course (Part 2): Approaches to text classification

Related Posts