Python

Wild Wednesday: handling semi structured JSON data

Wild Wednesday posts are all about taming semi or unstructured data. Today, we’re going to look at ingesting JSON data, generated from YARN, using the API; putting it into a dataframe and then outputting that information to a Hive table. JSON data can pose us with problems as it has a flexible schema (i.e. not […]

Read more
Spark

Koalas = Pandas simplicity + Spark’s scalability

Whenever you start poking around a dataset to see what you’ve got to play with, you probably immediately write ‘import pandas as pd’ – why? Because Pandas is the gold standard in data analysis libraries; it’s so simple & yet so powerful. The problem is, Pandas just doesn’t scale, so if you’re going to be […]

Read more