Wild Wednesday: handling semi structured JSON data

Wild Wednesday posts are all about taming semi or unstructured data. Today, we’re going to look at ingesting JSON data, generated from YARN, using the API; putting it into a dataframe and then outputting that information to a Hive table.

JSON data can pose us with problems as it has a flexible schema (i.e. not all entries have to have the same attributes – that is, not all rows have to have all the same columns. Luckily, Pandas deals with that nicely. To start off, we need to import a few modules:

Import json
import pandas as pd
from datetime import datetime, time
import requests

from time import mktime

Now, let’s make the call to the API. Instead of using a curl command via the os.system module, we can use the requests library:

r = requests.get(‘http://yourIP:8088/ws/v1/cluster/apps/?limit=10‘)

We then say, if we got a successful response from our call, pull the JSON into a pandas dataframe called ‘df’.

if r.status_code == 200:
df = pd.DataFrame(r.json()[‘apps’][‘app’])

Now, I am only interested in pulling the data for jobs that have been running today, because I’ll run this as a daily job, for trending purposes. In the below, I calculate what the unix timestamp for midnight today was.

midnight = datetime.combine(datetime.today(), time.min)
#convert time to unix timestamp
starttime = mktime(midnight.timetuple())

The, I filter my dataframe to return only the records where the start time of the job was greater than midnight.

df = df.loc[df[‘startedTime’] > starttime]

Finally, I export our output to CSV as a pipe delimited file, because some of our fields include commas, which would lead to misalignment.

df.to_csv(‘yarn_output.csv’, sep=’|’, index=[‘id’])

We can then use the HDFS Put command to move our CSV to the directory for our Hive table.

hdfs dfs -put yarn_output.csv /user/hive/warehouse/your.db/your.table/yarn_output.csv

So there we have it, we’ve taken our JSON data and put it into a structured data source ready for analysis.

Kodey