Association rule mining is an unsupervised learning algorithm which finds patterns in our data, whereby we know how likely it is that Y occurs in the event of X. In other words, it finds features (or dimensions) which appear together frequently. Note: just because someone buys burgers with ketchup, does not mean someone that buys […]

Read more## How To Implement A Random Forest Machine Learning Model On An Imbalanced Dataset

For this article, I wanted to demonstrate an end to end model implementation, including tuning for an imbalanced dataset. First off, let’s ingest our data: Now, we can start to explore our dataset. First off, are there any null values that we need to worry about? There are no nulls! a rare pleasure. Let’s now […]

Read more## Methods For Increasing Model Accuracy (Dimensionality Reduction; Class Imbalance; Hyperparameters)

Tuning our machine learning model is of course critical to its accuracy. There are two methods for tuning models: tuning the data and tuning the parameters. We ‘tune’ the data to clean it up, select the right features and make sure we don’t have a class imbalance – because afterall, if you pump garbage into […]

Read more## The Four Key Accuracy Metrics To Prove Or Improve Our Machine Learning Model Performance

In the last article, we started to look at the implemented algorithm and looked at the model score along with the confusion matrix. In this article, I am going to cover off a high level overview of some of the methods we could use to evaluate our model performance. Classification Accuracy Classification Accuracy is the […]

Read more## Implementing A Logistic Regression Machine Learning Model On Our Diabetes Dataset

Following on from my previous article; we are continuing our journey to predicting whether someone has diabetes. The way we do this is pretty straight forward. First, we get our data ready to be ingested by the model. To do this, we create two new dataframes y and X. The y dataframe stores the outcomes […]

Read more## Exploring Data Using Seaborn And Pandas As A Data Scientist

Here we have a dataset from Kaggle; all data within this dataset are females aged 21 and above. The task is to take that data and build a classification model which will determine, given some information about a patient, whether they are diabetic. In this post, we’re going to ingest and explore the dataset and […]

Read more## Overview Of K Nearest Neighbours Machine Learning Algorithm For Aspiring Data Scientists

K Nearest neighbours is one of the simplest machine learning models to explain Let’s say, we have lots of historical data with a class assigned. In the below, you can see that we have a number of points in the red, blue and green classes. The idea then is, if we have a new point […]

Read more## How Does The Naive Bayes Machine Learning Algoritm Work?

A Naive Bayes algorithm is a classifier. It takes into account the probability of each feature occuring and determines the overall probability of the target class (outcome). From that, it takes the highest probabiltiy and returns that as its prediction. The reason it’s naive is, it acts as though features don’t depend on one another […]

Read more## What On Earth Is The Support Vector Machines (SVM) ML Model?

Support vector machines are supervised classification models, within which each observation (or data point) is plotted on an Ndimensional array, where N is the number of features. If we had height and weight, we would have a 2 dimensional graph. And if we were trying to classify those people as average or overweight, we would […]

Read more## Boost Your Random Forest Machine Learning Model Accuracy With Gradient Boosted Machines

In a traditional random forest, there is parallel learning. In the below, we can see that each model samples data from the overall dataset and produces a model from it. It does this in parallel and independently – no model has any influence on any other model. Gradient boosting seeks to improve on a weak […]

Read more## My Three Favourite Supervised Regression Machine Learning Model Options

In this article, we’re going to cover what models you could use to predict continuous numeric values. You have a number of model choices – let’s discuss three: Linear regression Random forest regressor Gradient Boosting Tree These are big enough topics to have their own articles and indeed, they shall. But for now, let’s give […]

Read more## Three Methods For Feature Engineering and Selection For Data Engineers

As with the previous sections in this series, there is a little overlap – but not a huge amount. The techniques we’re going to discuss are related to feature engineering and feature selection. Binning data Binning is a really useful technique. It’s a way to convert continuous variables into discrete variables by bucketing them in […]

Read more