NLP Series: Is It Spam? Text Classification

Text classification is all about classifying a body of text, without actually reading it. It’s a supervised machine learning algorithm which uses term frequency (how often words occur) to classify the document. 

The classic example is to determine whether an email is spam or not. To do this, we can ingest a dataframe which includes message (what the text of the email was) and label (whether it was spam or not).

Now, we split our data out into test and training sets and we will be ready to rock and roll.

from sklearn.model_selection import train_test_split

x = df['message']
y = df['label']

x_train, x_test, y_train, y_test = train_test_split(x,y, test_size=0.3, random_state=42)

Vectorization

The first step in any text classification task (after the ingestion of data & the test train split) is to vectorize the text. What’s vectorization? Let’s look at an example. Let’s say I have the text: ‘Monkey see monkey do’ and another block of text: ‘Monkey eats bananas’.

In the below, you can see an example of vectorization. It’s simply a count of the occurrence of each word in an article / body of text.

DOCUMENTMONKEYSEEDOEATSBANANAS
DOC121100
DOC210011

In the below, we’re taking our x_train data and vectorizing it to create a variable called x_train_counts. The result of this (in my dataset) is a sparse matrix (3700 rows by 7000 columns).

from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
x_train_counts = count_vect.fit_transform(x_train)

Let’s tackle each part of what I just said:

  • A sparse matrix is one that primarily not populated. In the above example, we had just 5 words. Imagine, across a long body of text, you’ll have thousands of words and only a few will appear in each article. Hence, it will be sparsely populated.
  • The 3700 rows by 7000 columns is derived from 3700 individual text documents (rows of my Pandas dataframe) being vectorised. They have, across all the documents 7,000 unique words. Hence, like above, we have a very wide matrix with each possible word included in it.

The interesting thing, is if you did x_train.shape, you would get a result of (3700,) as it’s a 1D array. Whereas, x_train_count.shape, would give you a result of (3700,7000). So you can see clearly the transformation which occurred during vectorization.

TF-IDF

Okay, so now we know how many times each word appears in the document. But that isn’t really all that useful. I wouldn’t mind betting that the top 10 or more of the words in your vectorized analysis are ‘the’, ‘sometimes’ or other stop words which don’t add any meaning to the document.

Term Frequency-Inverse Document Frequency (TFIDF) does two things:

  1. It looks at the frequency of each word across the entire corpus (all the documents) rather than one document in isolation.
  2. It inverses the weighting assigned to each word. For example, if ‘the’ is the most commonly occurring word, it will have the least importance to the model; while the word ‘vectorization’ may only occur once across all the articles, and hence it’s totally unique and must be an important term in our document. 

 In other words, we take the relative frequency of words across all articles and determine how important they are to the topic of the article.

TFIDF specifically:

  1. Calculates the term frequency (TF) – which is simply, how often that term occurs
  2. Weighs down (makes less important) frequently used words (like ‘is’ or ‘the’)
  3. Weighs up (makes them more important) the unique or less used terms in the document

We can apply TF-IDF using the below code.

from sklearn.feature_extraction.text import TfidfTransformer
transformer = TfidfTransformer()
x_train_tf = transformer.fit_transform(x_train_counts)
x_train_tf.shape

That code though, is more convoluted than it needs to me. We used the vectorizer and the TF-IDF functions as two separate blocks of code. We don’t need to do this, there is a short hand method. Here we do fit_transform directly on x_train, rather than on the vectorized input of x_train_counts. The output is the same.

from sklearn.feature_extraction.text import TfidfVectorizer
vect = TfidfVectorizer()
x_train_vect = vect.fit_transform(x_train)
x_train_vect.shape

Training an SVM model  

Right, so now we have our data prepared. Let’s train our model. In the below, we’re importing the required libraries and fitting the model to our training data.

from sklearn.svm import LinearSVC
clf = LinearSVC()
clf.fit(x_train_vect, y_train)

To make things a bit neater and more manageable we can deploy it directly in a pipeline. In the below, I have named two steps: one is a vectorizer step which calls the TFIDF vectorizer and the second is the linear support vector machine model.

from sklearn.pipeline import Pipeline
text_classifier = Pipeline([('Vectorizer Step', TfidfVectorizer()), ('Classifier', LinearSVC())])

We can use this pipeline simply by fitting it with our data, just like we did above.

text_classifier.fit(x_train, y_train)

Making predictions

Now, we can make some predictions. We simply pass our x_test data into the pipeline. The pipeline will then run the vectorization and the classification model to produce the predictions.

predictions = text_classifier.predict(x_test)

We then can check accuracy in normal ways with a confusion matrix, classification report and accuracy score.

from sklearn.metrics import confusion_matrix, classification_report
from sklearn import metrics
print(confusion_matrix(y_test, predictions))
print(classification_report(y_test,predictions))
metrics.accuracy_score(y_test, predictions)

Passing new values into the model

We can now make some predictions, the below is determined to NOT be spam.

text_classifier.predict(['this is a message I want to test'])

While this is predicted to be spam. 

text_classifier.predict(['you have won 1 million pounds and free entry to a contest'])

An end to end project:

import pandas as pd
import numpy as np
from sklearn.metrics import confusion_matrix, classification_report
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC

#ingest our review data
df = pd.read_csv('/home/reviews.tsv', sep='\t')

#handle nulls
df.dropna(inplace=True)

#split out X (feature) and Y (target)
x = df['review']
y = df['label']

#split our test and train datasets
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size = 0.3, random_state = 42)

#Create a pipeline to vectorize and then classify our data
text_classifier = Pipeline([('vectorize', TfidfVectorizer()), ('Classifier', LinearSVC())])

#use the new pipeline and fit our training data to it
text_classifier.fit(x_train, y_train)

#Make some predictions based on the test data
predictions = text_classifier.predict(x_test)

#Validate accuracy
print(confusion_matrix(y_test, predictions))
print(classification_report(y_test,predictions))
metrics.accuracy_score(y_test, predictions)

#Manually check with a review
text_classifier.predict(['Terrible movie'])
Kodey

One thought on “NLP Series: Is It Spam? Text Classification

Comments are closed.