

Data Engineering & Data Science For Everyone | Python Code | PySpark | Machine Learning | Statistics | Predictive Analytics | Data VIsualization | Time Series Analysis | NLP
Here we have a dataset from Kaggle; all data within this dataset are females aged 21 and above. The task is to take that data and build a classification model which will determine, given some information about a patient, whether they are diabetic. In this post, we’re going to ingest and explore the dataset and […]
Read moreK Nearest neighbours is one of the simplest machine learning models to explain Let’s say, we have lots of historical data with a class assigned. In the below, you can see that we have a number of points in the red, blue and green classes. The idea then is, if we have a new point […]
Read moreA Naive Bayes algorithm is a classifier. It takes into account the probability of each feature occuring and determines the overall probability of the target class (outcome). From that, it takes the highest probabiltiy and returns that as its prediction. The reason it’s naive is, it acts as though features don’t depend on one another […]
Read moreSupport vector machines are supervised classification models, within which each observation (or data point) is plotted on an Ndimensional array, where N is the number of features. If we had height and weight, we would have a 2 dimensional graph. And if we were trying to classify those people as average or overweight, we would […]
Read moreIn a traditional random forest, there is parallel learning. In the below, we can see that each model samples data from the overall dataset and produces a model from it. It does this in parallel and independently – no model has any influence on any other model. Gradient boosting seeks to improve on a weak […]
Read moreIn this article, we’re going to cover what models you could use to predict continuous numeric values. You have a number of model choices – let’s discuss three: Linear regression Random forest regressor Gradient Boosting Tree These are big enough topics to have their own articles and indeed, they shall. But for now, let’s give […]
Read moreAs with the previous sections in this series, there is a little overlap – but not a huge amount. The techniques we’re going to discuss are related to feature engineering and feature selection. Binning data Binning is a really useful technique. It’s a way to convert continuous variables into discrete variables by bucketing them in […]
Read moreIn the previous article, I highlighted the below phases as being part of the ML workflow. Data cleansing, formatting Data exploration Feature engineering and feature selection Initial machine learning model implementation Comparison of different models Hyper parameter tuning on the best model Evaluation of the model accuracy on testing data set Understanding the model In […]
Read moreWhen we’re working through a data science problem, there really are a few main steps which we need to take. These are outlined below: Data cleansing, formatting Data exploration Feature engineering and feature selection Initial machine learning model implementation Comparison of different models Hyper parameter tuning on the best model Evaluation of the model accuracy […]
Read moreDetecting outliers in highly dimensional data is hard. There are so many observations across a large number of dimensions; so plotting it is often not possible and if you can plot it, interpreting it with the naked eye is extremely challenging. Before we get into how we may identify the outliers using an isolation forest, […]
Read moreAs you may be aware, I am a Python tutor online and quite often I get asked pretty specific questions. This week, I was asked to show a simple way to find the distance in kilometres between two geographic locations. So, here it is then – the easiest way to approach this problem is to […]
Read moreThe below script takes the input of a UK postcode and ensures that it matches a valid format. I have handled the below formats: X11XX XX11XX XX111XX To do this, I use some regular expressions. Let’s look at one: “^[a-zA-Z]{1}[0-9]{2}[a-zA-Z]{2}”. Here: The beginning of the string (denoted with a ^) needs to be a letter. […]
Read more