ML Series: Predicting Diabetes; Exploring Data

Here we have a dataset from Kaggle; all data within this dataset are females aged 21 and above. The task is to take that data and build a classification model which will determine, given some information about a patient, whether they are diabetic.

In this post, we’re going to ingest and explore the dataset and in the next post, we’ll actually build the model and start making predictions.

Importing the data is as easy as it gets:

import pandas as pd
df = pd.read_csv('diabetes2.csv')

Now, the interesting bit. Let’s get exploring. The first thing which I like to do is check whether we have any nulls in the dataset:

df.isnull().sum()

And as you can see from the output, no, we don’t; which of course means that we don’t have to do a whole lot of data cleaning.

Now, I want to see how the data looks. I could do a lot of individual charts to achieve this, but right now, I just want to see everything at a high level & start looking at relationships between different features. So, we’re going to use a pairplot, in the below, I save that plot to a file.

import seaborn as sns
x=sns.pairplot(df)
x.savefig("output.png")

The output is shown below. This is super useful as we can now see the relationship betwen all of the fields. We have all the fields on both axis. So, in the top left, we have pregnancies vs pregnancies; so of course, we just the the distribution of that single field.

As we move across to the right, we have pregnancies vs glucose and we could say that there is a slight correlation between the number of pregnancies and glucose levels as the chart slants ever so slightly to the right, but it’s nothing to write home about – it’s negligable.

We can keep looking through the variables & we can see that there are other correlations but none are particularly strong and none are particularly exciting.

The above takes quite a lot of brainpower, which is bad. We can very subtly alter our code like below & it’s going to fit a trend line to the data for us. With this, it makes it much easier for us to see the correlation between the two features. The ‘kind=”reg”‘ ‘ part of the code means ‘kind = regression’; hence, the regression line.

x=sns.pairplot(df, kind="reg",markers=".")
x.savefig("output.png")

We can go even further and start relating those datapoints to the outcome (i.e. did they have diabetes or not). Here, we can see clearly that as glucose goes up; so does the risk of diabetes. I would however still make the claim that there aren’t any massively strong correlations.

x=sns.pairplot(df, kind="reg",markers=".", hue = 'Outcome')
x.savefig("output.png")

We can validate that with a correlation heatmap. As you can see below, there is indeed a very slightly positive correlation between pregnancies and glucose levels as we discussed above; and we can validate that there aren’t any massively strong correlations; except those which are expected (e.g. age vs pregnancies).

sns.heatmap(df.corr(), annot=True)

If I wanted to deep dive on that stronger correlation, I can do so. As you can see in the below, as age goes up, certainly there is a good trend upwards until we hit a certain age, at which point, it drops off. Across all ages there is quite a high spread of data.

import matplotlib.pyplot as plt
plt.subplots(figsize=(20,15))
sns.boxplot(x='Age', y='Pregnancies', data=df)

As a final task here; it’s not data exploration, but it’s a prepatory step for our model. We’ll standardize all of the fields in our dataset, so we don’t end up with massive disparity in values between multiple fields. Below is a manual way to do this; there are libraries you can use to make life easier.

df['Glucose'] = df['Glucose']/max(df['Glucose'])
df['Pregnancies'] = df['Pregnancies']/max(df['Pregnancies'])
df['BloodPressure'] = df['BloodPressure']/max(df['BloodPressure'])
df['SkinThickness'] = df['SkinThickness']/max(df['SkinThickness'])
df['Insulin'] = df['Insulin']/max(df['Insulin'])
df['BMI'] = df['BMI']/max(df['BMI'])
df['DiabetesPedigreeFunction'] = df['DiabetesPedigreeFunction']/max(df['DiabetesPedigreeFunction'])
df['Age'] = df['Age']/max(df['Age'])

Okay, so we’ve explored our data; we have a good idea of it’s shape, distribution etc… Luckily for us, the data was pretty clean, so we didn’t need to do a whole bunch of cleaning up. In the next article, we’ll start building a model.