In this article, I am going to walk through a really simple machine learning scenario, using data from Kaggle. In future articles, I will look at tuning this model to improve the accuracy and see how different algorithms produce differing results.
The purpose of this article is an introduction to a machine learning model with the fewest parameters possible.
The first thing then is to bring our data into the dataframe. That’s simple, as it’s a CSV file, which I read in using the Pandas library.
In this case, I know the data is clean, so we don’t need to get bogged down handling null values or doing any data cleansing – which is perfect for this introductory article. To get started, let’s import some libraries.
Now that we have imported the libraries, let’s get to work. We need to make a split of our data to extract the target column from the dataset. Why? Because if a model ingests the target column as part of the dataset, you’re giving it a feature with 100% correlation to the expected output. Hence, model accuracy is likely to be 100%.
Next, as I discussed here, we need to scale the data. Scaling is a method used to standardise the range of data. This is important as if one field stores age (between 18 and 90) and another stores salary (between 10,000 and 200,000), the machine learning algorithm might bias its results towards the larger numbers, as it may assume they’re more important. SciKitLearn state that “If a feature has a variance that is orders of magnitude larger that others, it might dominate the objective function and make the estimator unable to learn from other features correctly as expected.”
Next, we convert our dataframes into NumPy arrays to be passed into the test / train split library. This library splits the dataset, so we have 75% for training our model and the remaining 25% of our data to test the accuracy of our model. Random state simply means that the split the data will be the same every time we run the model. So if we run it 100 times, we will always get the same data split and output – which is really helpful to validate your tuning efforts.
We then simply define the type of model we want to use which in this case is a random forest classifier. Here, we have 50 estimators (i.e. 50 decision trees in the forest).
We then fit the model with our training data set so it can learn the patterns & build its estimation algorithm and then we run a prediction with the test features (the 25% of our dataset).
Finally, we check the accuracy. First, we pass in the model score using our training features and training labels – of course this is 100% accuracy, as the model has already seen that dataset & knows the outcome.
The test dataset produces a model accuracy of 0.68 (68%).
So what about tuning?
It may be the case that we can squeeze a higher level of accuracy out of our model & we’ll go through a number of tests to do this in future articles. Something worth checking always is the correlation of the dataset. Sometimes you have a few variables that are highly correlated with the outcome and you could reduce the number of features & sometimes improve accuracy.
As you can see from the below heatmap, there are no features with a strong correlation with the quality field (the outcome).
Stay tuned to Kodey for how to approach the model tuning task.