Validating error estimation of our models with Cross Validation

Let’s say we have the below dataset. It’s a list of patients, with some health KPI’s and a boolean field which says whether they have diabetes or not. If you had thousands of patients, you would probably find some trends in this data which would allow you to predict if someone had diabetes, based on their health KPI’s, without needing to actually administer a test.

WeightBlood SugarBlood PressureDiabetes
9520145Yes

The idea being, we train our model using the above data & when a new patient record arrives we can use that model to predict if they have diabetes.

WeightBlood SugarBlood PressureDiabetes
764101???

When we test our model, it’s quite common to split our data (75% for training and 25% for testing). But, how do we know if that’s a good way to divide the data? Will that capture enough of the nuances of the data to make an accurate model? The truth is, we don’t know.

Cross Validation splits our data into X number of blocks. In four-fold cross validation, the data would be split into four chunks. It then runs 4 times across the whole dataset, changing which chunk is a test chunk and which is a training chunk (like below).

CVBlock1Block2Block3Block4Outcome
4 (run1)TrainTrainTrainTest70%
4 (run2)TrainTrain TestTrain71%
4 (run3)TrainTestTrainTrain68%
4 (run4)TestTrainTrainTrain73%

10 fold cross validation is the most common level to go to but four was simpler to demonstrate in a table above.

Now, imagine with this diabetes problem, you didn’t know whether to use a logistic regression model; a k-nearest neighbors model or a random forest model. If you were to run cross validation on each model, you’d be able to determine which would give you the best overall accuracy. So, cross validation can help you to determine the best model to go with.

Implementing cross validation in Python is pretty simple, just as below.