ML Series: The Basics of Timeseries Analysis – Stationarity and Autocorrelation

The purpose of time series analysis is to analyse timeseries data and extrapolate the patterns you identify into the future, enabling us to make predictions.

It’s important to understand when we can use timeseries analysis. If the timeseries has a clear pattern, trend and seasonality then you can accurately model a forecast. If a dataset has a great deal of randomness and high variance, then we will be unable to forecast accurately – hence why nobody has stock price forecasting down to an art yet – it’s just too random.

Of course, when we come to try to make a forecast, we need to consider whether we have enough historic data to work from; whether there is a clear pattern and importantly, we need to consider whether the pattern in the data changes over time – if it does, we can use expoential smoothing where we put more weights on more recent dates; to take into account current patterns.

When we’re working on a timeseries project, we need to be pragmatic about the kind of forecasting we can achieve. The further into the future the forecast goes, the more inaccurate it is; so we don’t want to overburden ourselves by trying to generate a 5 year forecast.

Okay, so that’s the intro done. Phew!

The data:

When we consider our dataset, we have two things to consider:

  1. Is it a regular or irregular timeseries? Regular timeseries have a specific interval – e.g. we may get one observation per day. An irregular timeseries is more sporadic and we get observations every so often. It’s much harder to forecast an irregular timeseries.
  2. Are we conducting univariate or multivariate analysis? In other words, how many colums are we passing into the model to predict. If we simply have DATE & VALUE, we have univariate analysis.

Stationarity:

Okay so this is about to get a whole bunch of confusing, but hang in there.

We have a concept called stationarity – whereby, we need a dataset to be stationary to be able to run accurate predictions on it. A statioinary dataset has the same statistical properties throughout the timeseries – the variance, mean and autocorrelation (discussed below), will be the same, whatever window of time you choose.

Now, if we think about the below dataset, it’s already stationary – we can tell by looking at it that the variance is consistent throughout the model, there is no upward or downward trend, there is no seasonality.


Whereas, the below timeseries data clearly does not possess any of those attributes. There is an incredibly strong trend, the mean is clearly not the same in a window spanning 0 to 4 as it is spanning 6 to 10.


We can handle shifts in the mean quite nicely. We use a concept called differencing, which, rather than using the raw values, simply takes the difference between Y and Y-1 (i.e. the difference between the current data point and the previous one). When we consider seasonal data, it will rather take the same point from the same season last year, so we are comparing the change from this year to last year.

We can handle the shift in variance using the ARCH/GARCH models, however, this is not always necessary as most models only model the mean & hence we don’t always have to worry about variance.

Okay, so there is of course a statistical way to determine whether we have a stationary dataset or not. We use the Dickey-Fuller Test, which tests for stationarity for us. If the P value it returns is less than 0.05 then your dataset is stationary.

from statsmodels.tsa.stattools import adfuller
x = [0,1,3,5,7,9,6,60, 77, 65, 77]
df_test = adfuller(x, autolag = "AIC")

We get the below output. The important thing to note here is the second item within the tuple. 0.99. This is the P-Value and as you can see, it’s not below 0.05, hence we do not have a stationary dataset. No surprise, as this was the graph above with the aggressive trend.

(1.976194812493461,
 0.9986412088826317,
 3,
 7,
 {'1%': -4.9386902332361515,
  '5%': -3.477582857142857,
  '10%': -2.8438679591836733},
 62.42045333867975)

If we run the same on the first dataset:

import numpy as np
x = np.random.normal(1,3,300) #mean of 1. Std dev of 3. 300 datapoints

We get the below output. If we take the P Value out of scientific format, we get: 0.00000000000000000000000000000536056741894263. This is most certainly below 0.05.

(-17.33804743990615,
 5.36056741894263e-30,
 0,
 299,
 {'1%': -3.4524113009049935,
  '5%': -2.8712554127251764,
  '10%': -2.571946570731871},
 1436.4315122401867)

Autocorrelation

Autocorrelation is anothe rimportant term to get our heads around. Essentially, here, we are correlating our data against itself, with some amount of shift; to determine whether early observations influence later observations. Trend and seasonality are autocorrelation examples. With an upward trend, there is a good chance that T is higher than T-1and July 2019 will be a good indicator of weather in July 2020.

The autocorrelation function: which looks at T and compares it to T-1. Then, compares T to T-2, etc… The idea is to understand how correlated T is with each previous point in the dataset – did any of those points influence T. This approach means we will pick up seasonal trends, where July last year is very indicative of weather in July this year.

If we look at an ACF chart and see it tailing off towards the end (smaller spikes), it indicates that the further away you get from the original datapoint, the less the correlation between the data. If you imagine, comparing Janurary 01 to Janurary 10, you will probably find a strong link; if you compare July 15 to Janurary 01, you won’t find any correlation there.

If we look at an example below, we can see some clear seasonality – as it does seem that historic datapoints can influence future ones & they seem to be spaced seasonally. Note: Anything outside of the blue area is statistically significant.

Kodey