Support vector machines are supervised classification models, within which each observation (or data point) is plotted on an Ndimensional array, where N is the number of features.
If we had height and weight, we would have a 2 dimensional graph. And if we were trying to classify those people as average or overweight, we would draw a line like below, where above the line is overweight and below is average or OK (note, this is mock data & does not represent real over/under weight values).
But, where do we draw the decision boundary? Well, think of the boundary line like a road. We have loads of traffic down the road so we want it to be as wide as possible. We could, draw the road like below between the two groups. The group in blue would be ‘Overweight’ and the group in orange would be ‘OK’. Of course, if we have multiple categories, we will have multiple decision boundaries.
Or, we could draw it like the below. This road is much wider and hence is the better split – wider margins help to prevent overfitting. To clarify, we are looking for the widest margin – the one where the distance between the points and the central line are as far as possible. We call this central line the hyperplane.
Does it even matter? Yes! Unoptimised decision boundaries could result in misclassification of new data, which would be bad!
So what exactly is a support vector? Well, a support vector is a datapoint (also known as a vector) that the margin pushes up against. In other words, they’re the points closest to the road.
The support vector machines algorithm is pretty memory efficient as it only considers the support vectors as being important when building the model, hence uses a subset of the overall datasize.
If we have a single dimension of data, we can run a function to turn it into a 2 dimensional parabola.
- Effective in high dimensional places
- Uses a subset of datapoints in the support vector function so its memory efficient
- Number of features > number samples = poor performance
Text classification is all about classifying a body of text, without actually reading it. It’s a supervised machine learning algorithm which uses term frequency (how often words occur) to classify the document.
The classic example is to determine whether an email is spam or not. To do this, we can ingest a dataframe which includes message (what the text of the email was) and label (whether it was spam or not).
Now, we split our data out into test and training sets and we will be ready to rock and roll.
from sklearn.model_selection import train_test_split x = df['message'] y = df['label'] x_train, x_test, y_train, y_test = train_test_split(x,y, test_size=0.3, random_state=42)
The first step in any text classification task (after the ingestion of data & the test train split) is to vectorize the text. What’s vectorization? Let’s look at an example. Let’s say I have the text: ‘Monkey see monkey do’ and another block of text: ‘Monkey eats bananas’.
In the below, you can see an example of vectorization. It’s simply a count of the occurrence of each word in an article / body of text.
In the below, we’re taking our x_train data and vectorizing it to create a variable called x_train_counts. The result of this (in my dataset) is a sparse matrix (3700 rows by 7000 columns).
from sklearn.feature_extraction.text import CountVectorizer count_vect = CountVectorizer() x_train_counts = count_vect.fit_transform(x_train)
Let’s tackle each part of what I just said:
- A sparse matrix is one that primarily not populated. In the above example, we had just 5 words. Imagine, across a long body of text, you’ll have thousands of words and only a few will appear in each article. Hence, it will be sparsely populated.
- The 3700 rows by 7000 columns is derived from 3700 individual text documents (rows of my Pandas dataframe) being vectorised. They have, across all the documents 7,000 unique words. Hence, like above, we have a very wide matrix with each possible word included in it.
The interesting thing, is if you did x_train.shape, you would get a result of (3700,) as it’s a 1D array. Whereas, x_train_count.shape, would give you a result of (3700,7000). So you can see clearly the transformation which occurred during vectorization.
Okay, so now we know how many times each word appears in the document. But that isn’t really all that useful. I wouldn’t mind betting that the top 10 or more of the words in your vectorized analysis are ‘the’, ‘sometimes’ or other stop words which don’t add any meaning to the document.
Term Frequency-Inverse Document Frequency (TFIDF) does two things:
- It looks at the frequency of each word across the entire corpus (all the documents) rather than one document in isolation.
- It inverses the weighting assigned to each word. For example, if ‘the’ is the most commonly occurring word, it will have the least importance to the model; while the word ‘vectorization’ may only occur once across all the articles, and hence it’s totally unique and must be an important term in our document.
In other words, we take the relative frequency of words across all articles and determine how important they are to the topic of the article.
- Calculates the term frequency (TF) – which is simply, how often that term occurs
- Weighs down (makes less important) frequently used words (like ‘is’ or ‘the’)
- Weighs up (makes them more important) the unique or less used terms in the document
We can apply TF-IDF using the below code.
from sklearn.feature_extraction.text import TfidfTransformer transformer = TfidfTransformer() x_train_tf = transformer.fit_transform(x_train_counts) x_train_tf.shape
That code though, is more convoluted than it needs to me. We used the vectorizer and the TF-IDF functions as two separate blocks of code. We don’t need to do this, there is a short hand method. Here we do fit_transform directly on x_train, rather than on the vectorized input of x_train_counts. The output is the same.
from sklearn.feature_extraction.text import TfidfVectorizer vect = TfidfVectorizer() x_train_vect = vect.fit_transform(x_train) x_train_vect.shape
Training an SVM model
Right, so now we have our data prepared. Let’s train our model. In the below, we’re importing the required libraries and fitting the model to our training data.
from sklearn.svm import LinearSVC clf = LinearSVC() clf.fit(x_train_vect, y_train)
To make things a bit neater and more manageable we can deploy it directly in a pipeline. In the below, I have named two steps: one is a vectorizer step which calls the TFIDF vectorizer and the second is the linear support vector machine model.
from sklearn.pipeline import Pipeline text_classifier = Pipeline([('Vectorizer Step', TfidfVectorizer()), ('Classifier', LinearSVC())]) We can use this pipeline simply by fitting it with our data, just like we did above. text_classifier.fit(x_train, y_train)
Now, we can make some predictions. We simply pass our x_test data into the pipeline. The pipeline will then run the vectorization and the classification model to produce the predictions.
predictions = text_classifier.predict(x_test)
We then can check accuracy in normal ways with a confusion matrix, classification report and accuracy score.
from sklearn.metrics import confusion_matrix, classification_report from sklearn import metrics print(confusion_matrix(y_test, predictions)) print(classification_report(y_test,predictions)) metrics.accuracy_score(y_test, predictions)
Passing new values into the model
We can now make some predictions, the below is determined to NOT be spam.
text_classifier.predict(['this is a message I want to test'])
While this is predicted to be spam.
text_classifier.predict(['you have won 1 million pounds and free entry to a contest'])
An end to end project:
import pandas as pd import numpy as np from sklearn.metrics import confusion_matrix, classification_report from sklearn import metrics from sklearn.model_selection import train_test_split from sklearn.pipeline import Pipeline from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.svm import LinearSVC #ingest our review data df = pd.read_csv('/home/reviews.tsv', sep='\t') #handle nulls df.dropna(inplace=True) #split out X (feature) and Y (target) x = df['review'] y = df['label'] #split our test and train datasets x_train, x_test, y_train, y_test = train_test_split(x,y, test_size = 0.3, random_state = 42) #Create a pipeline to vectorize and then classify our data text_classifier = Pipeline([('vectorize', TfidfVectorizer()), ('Classifier', LinearSVC())]) #use the new pipeline and fit our training data to it text_classifier.fit(x_train, y_train) #Make some predictions based on the test data predictions = text_classifier.predict(x_test) #Validate accuracy print(confusion_matrix(y_test, predictions)) print(classification_report(y_test,predictions)) metrics.accuracy_score(y_test, predictions) #Manually check with a review text_classifier.predict(['Terrible movie'])
And that’s it!