Wild Wednesday posts are all about taming semi or unstructured data. Today, we’re going to look at ingesting JSON data, generated from YARN, using the API; putting it into a dataframe and then outputting that information to a Hive table.
JSON data can pose us with problems as it has a flexible schema (i.e. not all entries have to have the same attributes – that is, not all rows have to have all the same columns. Luckily, Pandas deals with that nicely. To start off, we need to import a few modules:
import pandas as pd
from datetime import datetime, time
from time import mktime
Now, let’s make the call to the API. Instead of using a curl command via the os.system module, we can use the requests library:
We then say, if we got a successful response from our call, pull the JSON into a pandas dataframe called ‘df’.
if r.status_code == 200:
df = pd.DataFrame(r.json()[‘apps’][‘app’])
Now, I am only interested in pulling the data for jobs that have been running today, because I’ll run this as a daily job, for trending purposes. In the below, I calculate what the unix timestamp for midnight today was.
midnight = datetime.combine(datetime.today(), time.min)
#convert time to unix timestamp
starttime = mktime(midnight.timetuple())
The, I filter my dataframe to return only the records where the start time of the job was greater than midnight.
df = df.loc[df[‘startedTime’] > starttime]
Finally, I export our output to CSV as a pipe delimited file, because some of our fields include commas, which would lead to misalignment.
df.to_csv(‘yarn_output.csv’, sep=’|’, index=[‘id’])
We can then use the HDFS Put command to move our CSV to the directory for our Hive table.
hdfs dfs -put yarn_output.csv /user/hive/warehouse/your.db/your.table/yarn_output.csv
So there we have it, we’ve taken our JSON data and put it into a structured data source ready for analysis.