ML

Using ROC Curves & AUC

This is a snippet from my upcoming book ‘Data Badass’ (pictured below): The ROC Curve & the Area Under Curve (AUC) is used for binary classification problems. The ROC curve chart looks at the True Positive Rate vs the False Positive Rate. Ideally, you want to reduce the number of false positives as much as […]

Read more
ML

K-Means Clustering Explained

K-Means clustering is an unsupervised machine learning model. Unsupervised learning is where we do not provide the model with the actual outputs from the data. Unsupervised learning aims to model the underling structure or distribution in the data to learn more about the data. The most popular use-cases of unsupervised learning are association rules – […]

Read more
ML, Python

Working with date time features in your ML models

Having a date in our dataset can really improve our predictions. Imagine an ice cream company, their data has a seasonal aspect (more sales in summer), so predicting sales information without dates in our data would be inaccurate. The problem is though, dates are horrible to work with. The first of June 2020 could be […]

Read more
ML

ML Scenario Part 3: Reducing dimensionality to improve accuracy

Reducing dimensionality can sometimes help to drastically improve your model accuracy. In the case of our sample dataset, I we don’t really expect it to – as there are not many features in the first place & there aren’t any that are particularly obsolete. In the below, you can see that none of our models […]

Read more
ML

How does machine learning work?

Machine learning uses statistical techniques to give computer systems the ability to ‘learn’ rather than being explicitly programmed. By learning from historical inputs. we’re able to achieve far greater accuracy in our predictions & constantly refine the model with new data. What parts comprise a machine learning model? A machine learning model can be drawn […]

Read more
ML

What are the types of machine learning?

Supervised learning Supervised learning is where we provide the model with the actual outputs from the data. This let’s it build a picture of the data and form links between the historic parameters (or features) that have influenced the output. To put a formula onto supervised learning, it would be as below, where, Y is […]

Read more
ML

So, what is a Random Forest?

A random forest is a grouping of decision trees. Data is run through each tree within a forest and provides a classification. It’s important that we do this, because, while decision trees are great & easy to read, they’re inaccurate. They’re good with their training data, but aren’t flexible enough to accurately predict outcomes for […]

Read more
ML

Simple Machine Learning Scenario: predicting quality

In this article, I am going to walk through a really simple machine learning scenario, using data from Kaggle. In future articles, I will look at tuning this model to improve the accuracy and see how different algorithms produce differing results. The purpose of this article is an introduction to a machine learning model with […]

Read more
ML

Using Gridsearch for Hyper Parameter Tuning

Hyperparameter tuning can be quite the task. It can be heavily manual and rely on gut feel as to what the value for that parameter should roughly be. If we did it in a completely manual way, I imagine that depression rates in data scientists would go through the roof but luckily, we have GridSearch. […]

Read more
ML

Validating error estimation of our models with Cross Validation

Let’s say we have the below dataset. It’s a list of patients, with some health KPI’s and a boolean field which says whether they have diabetes or not. If you had thousands of patients, you would probably find some trends in this data which would allow you to predict if someone had diabetes, based on […]

Read more
ML

Machine Learning Data Prep: Standardization & Encoding

There are so many things to consider when getting your data ready for machine learning models. In this article, I am going to cover encoding and standardization (or scaling) our data. Standardization / scaling to remove / reduce bias Scaling is a method used to standardise the range of data. This is important as if […]

Read more