For this article, I wanted to demonstrate an end to end model implementation, including tuning for an imbalanced dataset. First off, let’s ingest our data: Now, we can start to explore our dataset. First off, are there any null values that we need to worry about? There are no nulls! a rare pleasure. Let’s now […]

Read more## ML Series: Methods For Increasing Model Accuracy

Tuning our machine learning model is of course critical to its accuracy. There are two methods for tuning models: tuning the data and tuning the parameters. We ‘tune’ the data to clean it up, select the right features and make sure we don’t have a class imbalance – because afterall, if you pump garbage into […]

Read more## ML Series: Accuracy Metrics To Prove Or Improve Our Diabetes Algorithm

In the last article, we started to look at the implemented algorithm and looked at the model score along with the confusion matrix. In this article, I am going to cover off a high level overview of some of the methods we could use to evaluate our model performance. Classification Accuracy Classification Accuracy is the […]

Read more## ML Series: Predicting Diabetes; Initial Model Implementation & Confusion Matrix

Following on from my previous article; we are continuing our journey to predicting whether someone has diabetes. The way we do this is pretty straight forward. First, we get our data ready to be ingested by the model. To do this, we create two new dataframes y and X. The y dataframe stores the outcomes […]

Read more## ML Series: Predicting Diabetes; Exploring Data

Here we have a dataset from Kaggle; all data within this dataset are females aged 21 and above. The task is to take that data and build a classification model which will determine, given some information about a patient, whether they are diabetic. In this post, we’re going to ingest and explore the dataset and […]

Read more## ML Series: K-Nearest Neighbours Described

K Nearest neighbours is one of the simplest machine learning models to explain Let’s say, we have lots of historical data with a class assigned. In the below, you can see that we have a number of points in the red, blue and green classes. The idea then is, if we have a new point […]

Read more## ML Series: An Introduction To Naive Bayes

A Naive Bayes algorithm is a classifier. It takes into account the probability of each feature occuring and determines the overall probability of the target class (outcome). From that, it takes the highest probabiltiy and returns that as its prediction. The reason it’s naive is, it acts as though features don’t depend on one another […]

Read more## ML Series: What On Earth Are Support Vector Machines?

Support vector machines are supervised classification models, within which each observation (or data point) is plotted on an Ndimensional array, where N is the number of features. If we had height and weight, we would have a 2 dimensional graph. And if we were trying to classify those people as average or overweight, we would […]

Read more## ML Series: A Look At Gradient Boosted Trees

In a traditional random forest, there is parallel learning. In the below, we can see that each model samples data from the overall dataset and produces a model from it. It does this in parallel and independently – no model has any influence on any other model. Gradient boosting seeks to improve on a weak […]

Read more## ML Series: Supervised Regression Model Options

In this article, we’re going to cover what models you could use to predict continuous numeric values. You have a number of model choices – let’s discuss three: Linear regression Random forest regressor Gradient Boosting Tree These are big enough topics to have their own articles and indeed, they shall. But for now, let’s give […]

Read more## ML Series: Feature Engineering and Selection

As with the previous sections in this series, there is a little overlap – but not a huge amount. The techniques we’re going to discuss are related to feature engineering and feature selection. Binning data Binning is a really useful technique. It’s a way to convert continuous variables into discrete variables by bucketing them in […]

Read more## ML Series: Exploring Our Data

In the previous article, I highlighted the below phases as being part of the ML workflow. Data cleansing, formatting Data exploration Feature engineering and feature selection Initial machine learning model implementation Comparison of different models Hyper parameter tuning on the best model Evaluation of the model accuracy on testing data set Understanding the model In […]

Read more