ML

Using PySpark & SKLearn to deploy a machine learning model

Recently, I’ve been working to deploy a new machine learning model into a production environment. This is the first time I’ve had to deploy a model that runs across such huge datasets. The requirement is to make 30,000,000 predictions each time the model runs. In terms of the pipeline, it’s three distinct phases. The first, […]

Read more
ML

An end to end Random Forest Classifier implementation guide

In this article, we are going to go through some of the key steps to implementing a random forest machine learning model. The data we will be using is from Kaggle – it’s a dataset describing whether patients have diabetes or not. Luckily, this dataset is nice and clean, so we don’t need to worry […]

Read more
ML, Python

Expose your ML model via a simple API in Python

As data scientists, it is important that we have a method of sharing the insight from our models. In this post, I am going to show you how to create a super simple API, whereby the customer can pass URL parameters to extract data, generated by a Python function. Before we get started, make sure […]

Read more
Data, hive, ML

Working with dates in Apache Hive

Working with dates is one of those tedious things we frequently come across as data engineers. The frustration is that there are simply tonnes of date formats. Let’s list a few: Format Example MM/dd/yy 11/01/21 dd/MM/yy 01/11/21 yy/MM/dd 21/11/01 d/MM/yy 1/11/21 (no leading zeros) MMddyy 110121 ddMMyy 011121 yyyyMMdd 20211101 yyyy-MM-dd HH:mm:ss.SSS 01-11-2021 10:45:12.084 yyyy-MM-dd […]

Read more
ML

An introduction to timeseries models (AR, MA, ARMA and ARIMA)

Timeseries forecasting is quite a big topic to cover. I’ve spoken about key terminology and exponential smoothing in this article and I’ve spoken about how we might remove timeseries outliers here. In this post, I am going to discuss the different components of the ARIMA model (AR and MA), in addition to the ARIMA model […]

Read more
ML

MAE | RMSE | MAPE : Measures of model accuracy for data scientists

Mean Absolute Error (MAE) This simply takes the difference between the predicted value and the actual value for every prediction and takes an average of the result. However, to avoid values cancelling one another out, it takes the absolute value (which means, it makes all the values positive). Let’s consider an example. In the below, […]

Read more
ML

Data Badass Early Preview

An early view of my new book ‘Data Badass’ is available to view using this link. It’s not yet been through thorough editing and will be added to over time but I am keen to gather some feedback. It’s a book that covers the data basics; data platforms (including Hadoop, Kafka, Flume, Hive, Spark) and […]

Read more
ML

Using ROC Curves & AUC

This is a snippet from my upcoming book ‘Data Badass’ (pictured below): The ROC Curve & the Area Under Curve (AUC) is used for binary classification problems. The ROC curve chart looks at the True Positive Rate vs the False Positive Rate. Ideally, you want to reduce the number of false positives as much as […]

Read more
Data, ML

The data scientist learning plan for 2021

When you look online at what it takes to become a data scientist, it’s enough to make your brain melt. You have people telling you that you need to be an expert statistician / mathematician; you need to be a top-level coder in 15 different languages; be proficient with every type of SQL/NoSQL database on […]

Read more
ML

Can we successfully implement Agile in data science?

Agile is about iterative development and delivering tangible products/features quickly, which provides the business with value and ROI faster than a traditional waterfall project. Consider the example of a piece of accounting software. Overall, it’s going to have 50 features to support the accounts team. To deliver all of the features in a waterfall fashion, […]

Read more