Python

Extracting data from the YouTube API

In this article, we’re going to look at extracting data from the YouTube Data API. For this, we need to setup an API key, which you can get via the Google Developer Console. Now, you need to pick a channel to pull data from. In this case, I have taken the rapper Hopsin’s channel. To […]

Read more
ML

K-Means Clustering Explained

K-Means clustering is an unsupervised machine learning model. Unsupervised learning is where we do not provide the model with the actual outputs from the data. Unsupervised learning aims to model the underling structure or distribution in the data to learn more about the data. The most popular use-cases of unsupervised learning are association rules – […]

Read more
Python

A simple intro to sentiment analysis in Python

Sentiment analysis can provide key insight into the feelings of your customers towards your company & hence is becoming an increasingly important part of data analysis. Building a machine learning model to identify positive and negative sentiments is pretty complex, but luckily for us, there is a Python library that can help us out. It’s […]

Read more
Python

Statistical methods of handling outliers

Outliers can be a real pain & can skew our datasets. In the below, I’ve added an obvious outlier on row 3 (company name = Talane). If we were to take an average of InvoiceValue, this outlier would skew the result massively. Extract values within x standard deviations from the mean. Statistically, 99.7% of normally […]

Read more
ML

Machine Learning Data Prep: Standardization & Encoding

There are so many things to consider when getting your data ready for machine learning models. In this article, I am going to cover encoding and standardization (or scaling) our data. Standardization / scaling to remove / reduce bias Scaling is a method used to standardise the range of data. This is important as if […]

Read more
Spark

Fault Tolerant Friday: Creating robust pipelines in Python

When you’re dealing with tonnes of different datasources, issues crop up all the time: the datasource hasn’t been populated; the schema has changed; the schedule changes or the file format changes. The key with our jobs is to build in enough checks to at the very least warn us if something fails, so we can […]

Read more
tableau

Tabular Thursday: compare this period vs last period

Tablular Thursdays are all about doing cool things with our tabular data. This week, we’re going to look at a little function in Tableau, which is really useful. That function is the lookup function. Let’s say you have the below data. In the lookup minus 1 column, we’re taking the value from the row before […]

Read more
Python

Toolset Tuesday: Plotting data from your Pandas Dataframes

In this Toolset Tuesday post, we’re going to look at some simple plotting in Plotly. Why? Because the first step validating that your Python script has output accurate data is visually sense-checking it. If you know your data, you’ll have some expected outcomes. Here, I have my data in a Pandas dataframe. I’ve then taken […]

Read more
ML

ML Monday: Linear Regression Overview

In the ML Monday series, I aim to go through some of the theory behind machine learning & then move into practical application. Writing machine learning code is relatively trivial, but understanding why we choose a particular model and what those models actually do, is really important. Intro to linear regression Linear regression provides a […]

Read more
Spark

Starter Sundays: PySpark Basics, Selecting & Filtering Data

Welcome to this weeks Starter Sunday – which is where I cover off some basic concepts of Spark, Python or Hive. Today, we’re looking at selecting and filtering data from our dataframes in Spark, specifically, Pyspark. Select specific columns Below, we have a dataframe called df_select which will take just two columns from the dataframe […]

Read more
Spark

Deploying a log monitor with Spark Structured Streaming

In this article I am going to cover how to create a simple streaming log monitor using Spark Streaming. In subsequent posts, I’ll go deeper on some of these aspects. The first thing we need to do, as always is to create a Spark Session. There isn’t much new here, except that we import the […]

Read more