Having a date in our dataset can really improve our predictions. Imagine an ice cream company, their data has a seasonal aspect (more sales in summer), so predicting sales information without dates in our data would be inaccurate.
The problem is though, dates are horrible to work with. The first of June 2020 could be displayed as any of the below – and these are just a few examples!
- 01-06-2020 08:00:22.11
Machine learning models don’t accept dates as features – they’re just meaningless strings. So we need to split those dates out.
To do that, I am going to ingest some mock data. Here, we have some sample invoicing data where the date is in mm/dd/yyyy format. We ingest that using Pandas.
Next, let’s convert this to a date format. Here, we’re taking the invoice date and using the to_datetime function. We then define it as being mm/dd/yyyy format.
We can then use the date functions to extract day, month and year from the date into new columns.
You can then drop the original date column, so you don’t pass it in as a feature to your model. The 1 here is the axis (1 for columns and 0 for rows).
So there we go. Handling dates is absolutely critical for our machine learning models but is really quite simple. Hopefully this article has helped you to understand that 1) dates don’t belong in a ML algorithm and 2) it’s really easy to make features from our dates that do matter.