ML

Using ROC Curves & AUC

This is a snippet from my upcoming book ‘Data Badass’ (pictured below): The ROC Curve & the Area Under Curve (AUC) is used for binary classification problems. The ROC curve chart looks at the True Positive Rate vs the False Positive Rate. Ideally, you want to reduce the number of false positives as much as […]

Read more
Data

An introduction to structured data modelling

This is an introductory chapter of my upcoming book ‘Data Badass’ (pictured below): Data modelling is all about designing the way that your data is going to be organized. Think about it like building a house; you wouldn’t start laying bricks, without a plan. How would you know that the end result would meet your […]

Read more
Data

An introduction to data structures for aspiring data engineers

This is an introductory chapter of my upcoming book ‘Data Badass’ (pictured below): Data Types Data types are ways to classify data. For example, integer, string and boolean. These assignments / classifications determine what kind of values we can store in our data structures (outlined below). A string, is simply a text value. An example […]

Read more
Python

Scraping COVID-19 data from websites using Beautiful Soup

This article will overview how to extract data through screen scraping from a website. Specifically, this will focus on the UK government, open data website. Scraping websites is a contentious topic – while some websites don’t mind you doing it; some really would rather you didn’t and put in measures to try and stop you. […]

Read more
Spark

Handling data skew / imbalanced partitions in Pyspark

Data Skew is a real problem in Spark. It seems like the sort of thing which should be solved automatically by the job manager but unfortunately, it isn’t. Skew is where we have a given partition which contains a huge amount more data than others – leaving one executor to process a lot of data, […]

Read more
Data, hive

A guide to windowing functions in Hive for data analysis

Windowing functions in Hive are super useful. They make analysis that would otherwise be challenging, much easier. Let’s look at examples of the key windowing functions in Hive. We will use the below dataset for our analysis, this is the ‘payments’ table. customerid department payment_amount 1 1 10000 2 1 9000 3 2 12000 4 […]

Read more
Data, hive

The Hive SQL Crash Course For Data Analysts

SQL is one of the most in-demand data skills. The language has been adopted by many database platforms, including Apache Hive. This article will serve as a crash couse into the key functionality of Hive QL. Throughout this article, we will use the two sample tables as the basis for our code. THE CUSTOMERS TABLE […]

Read more
Data, ML

The data scientist learning plan for 2021

When you look online at what it takes to become a data scientist, it’s enough to make your brain melt. You have people telling you that you need to be an expert statistician / mathematician; you need to be a top-level coder in 15 different languages; be proficient with every type of SQL/NoSQL database on […]

Read more
ML

Can we successfully implement Agile in data science?

Agile is about iterative development and delivering tangible products/features quickly, which provides the business with value and ROI faster than a traditional waterfall project. Consider the example of a piece of accounting software. Overall, it’s going to have 50 features to support the accounts team. To deliver all of the features in a waterfall fashion, […]

Read more
ML

The Data Scientist Statistics Learning Plan For 2021

As data scientists, we need to be comfortable with mathematics. If you Google what you need to know, you’ll find answers stating you need to fully understand linear algebra; calculus and how to calculate all of the algorthms we use by hand. I’m not going to downplay the importance of understanding how the algorithm works, […]

Read more