Data

The Hive SQL Crash Course For Data Analysts

SQL is one of the most in-demand data skills. The language has been adopted by many database platforms, including Apache Hive. This article will serve as a crash couse into the key functionality of Hive QL. Throughout this article, we will use the two sample tables as the basis for our code. THE CUSTOMERS TABLE […]

Read more
ML

Machine Learning Data Prep: Standardization & Encoding

There are so many things to consider when getting your data ready for machine learning models. In this article, I am going to cover encoding and standardization (or scaling) our data. Standardization / scaling to remove / reduce bias Scaling is a method used to standardise the range of data. This is important as if […]

Read more
Python

Wild Wednesday: handling semi structured JSON data

Wild Wednesday posts are all about taming semi or unstructured data. Today, we’re going to look at ingesting JSON data, generated from YARN, using the API; putting it into a dataframe and then outputting that information to a Hive table. JSON data can pose us with problems as it has a flexible schema (i.e. not […]

Read more