ML Series: What On Earth Are Support Vector Machines?

Support vector machines are supervised classification models, within which each observation (or data point) is plotted on an Ndimensional array, where N is the number of features.

If we had height and weight, we would have a 2 dimensional graph. And if we were trying to classify those people as average or overweight, we would draw a line like below, where above the line is overweight and below is average or OK (note, this is mock data & does not represent real over/under weight values).

Chart, scatter chart

Description automatically generated

But, where do we draw the decision boundary? Well, think of the boundary line like a road. We have loads of traffic down the road so we want it to be as wide as possible. We could, draw the road like below between the two groups. The group in blue would be ‘Overweight’ and the group in orange would be ‘OK’. Of course, if we have multiple categories, we will have multiple decision boundaries.

Or, we could draw it like the below. This road is much wider and hence is the better split – wider margins help to prevent overfitting. To clarify, we are looking for the widest margin – the one where the distance between the points and the central line are as far as possible. We call this central line the hyperplane.

Does it even matter? Yes! Unoptimised decision boundaries could result in misclassification of new data, which would be bad!

So what exactly is a support vector? Well, a support vector is a datapoint (also known as a vector) that the margin pushes up against. In other words, they’re the points closest to the road. 

 The support vector machines algorithm is pretty memory efficient as it only considers the support vectors as being important when building the model, hence uses a subset of the overall datasize.

If we have a single dimension of data, we can run a function to turn it into a 2 dimensional parabola.


  • Effective in high dimensional places
  • Uses a subset of datapoints in the support vector function so its memory efficient 


  • Number of features > number samples = poor performance