Month: May 2020
24 posts
Validating error estimation of our models with Cross Validation
Let’s say we have the below dataset. It’s a list of patients, with some health KPI’s and a…
Dimensionality Reduction for Machine Learning using Principal Component Analysis (PCA)
Principal Component Analysis (PCA) is a linear transformation algorithm which sets out to simplify complex, highly dimensional data…
Machine Learning Data Prep: Standardization & Encoding
There are so many things to consider when getting your data ready for machine learning models. In this…
Fault Tolerant Friday: Creating robust pipelines in Python
When you’re dealing with tonnes of different datasources, issues crop up all the time: the datasource hasn’t been…
Tabular Thursday: compare this period vs last period
Tablular Thursdays are all about doing cool things with our tabular data. This week, we’re going to look…
Wild Wednesday: handling semi structured JSON data
Wild Wednesday posts are all about taming semi or unstructured data. Today, we’re going to look at ingesting…
Toolset Tuesday: Plotting data from your Pandas Dataframes
In this Toolset Tuesday post, we’re going to look at some simple plotting in Plotly. Why? Because the…
ML Monday: Linear Regression Overview
In the ML Monday series, I aim to go through some of the theory behind machine learning &…
Starter Sundays: PySpark Basics, Selecting & Filtering Data
Welcome to this weeks Starter Sunday – which is where I cover off some basic concepts of Spark,…
Spark Saturdays: Spark Structured Streaming, Data Sources and Sinks
In Spark Streaming, we have the concept of sources and sinks. A source is very much what you…