Following on from my previous article; we are continuing our journey to predicting whether someone has diabetes.
The way we do this is pretty straight forward. First, we get our data ready to be ingested by the model. To do this, we create two new dataframes y and X. The y dataframe stores the outcomes (the truth) and the X dataframe stores the features which are passed into the model to generate that outcome.
We then use an out of the box function to split out data into two chunks: training and testing datasets; where we provide a test size of 20%; leaving 80% of our data to train the model.
from sklearn.model_selection import train_test_split
y = df["Outcome"]
X = df.drop(["Outcome"], axis = 1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 7)
Now, we create a new model and fit our training data to the model (i.e. give the model all of the features & the corect output and let it find patterns in the data. We then use the scoring tool to score our model. The initial output is 76.62%.
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(solver = 'liblinear')
model.fit(X_train, y_train)
score = model.score(X_test, y_test)
print(score)
Confusion Matrix
If we want to understand a little more about the model score & the types of errors being made, we can use a confusion matrix. What do I mean by that? Well, we have 4 fields in a confusion matrix:
- True Positives (top left): where the model predicted yes & the model was correct
- True Negatives (bottom right): where the model predicted no and the model was correct
- False Postivies (top right): where the model predicted yes and the model was incorrect
- False Negatives (bottom right): where the model predicted no and the model was incorrect
You may ask, ‘we know that the model is 76% accurate; why do we care?’ – well, this information is useful as it lets us know what your model is getting wrong and we can compare that against other models & choose the right one.
For example, if we have a logistic regression model which had the below confusion matrix, we can see that it’s making a lot of false positives:
Has Diabetes | Does Not Have Diabetes | |
Has Diabetes | 92 | 31 |
Does Not Have DIabetes | 5 | 26 |
The below random forest, built on the same data has the same number of incorrect predictions. But in this case, it is more likely to predict that someone does not have diabetes, when in fact, they do.
Has Diabetes | Does Not Have Diabetes | |
Has Diabetes | 92 | 5 |
Does Not Have DIabetes | 31 | 26 |
As a data scientist; we need to decide which is preferable. Is it better to flag more people as being at risk of Diabetes when they are not; or vice versa?
Right, so implementing a confusion matrix is simple. We run predictions against the X_test dataset. We then use the confusion_matrix module from the sklearn library to compare the y_test (actual outcomes) vs those we just predicted. You can see that the outcome is an array.
Y_pred = model.predict(X_test)
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, y_pred)
array([[92, 5],
[31, 26]])
That array can be written like below. Rows correspond to the predicted output & the columns correspond to the known truth. You can see, when the model predicted that the individual has diabetes and they did have diabetes, we have a value of 92. This is a true positive. Note: the number of columns ina confusion matrix is determined by the number of things we are trying to predict. If we had: Has Diabetes; Does Not Have Diabetes or Might Have Diabetes; we would have an additional column and an additional row in our matrix.
Has Diabetes | Does Not Have Diabetes | |
Has Diabetes | 92 | 5 |
Does Not Have DIabetes | 31 | 26 |
In my next article, I’ll be speaking about metrics we can use to evaluate our model accuracy.