ML Series: Dealing with outliers in Timeseries Analysis to reveal trends and patterns

Welcome back to the series on timeseries analysis. In this article, we’re going to discuss: plotting timeseries data and smoothing the data to handle outliers & make finding a trend a little bit easier.

In the below, I have ingsted the timeseries data into a dataframe called df. From that dataframe, I have then set my own index, which has an annual frequency (the datapoint is every December). In both the original and the timeseries_df, we have indexed the dataframe by the date column.

#import data using year column as index. 
df = pd.read_csv("timeseriesdata.csv", header = 0,
                     names = ['year', 'sales'],
                     index_col = 0)

#convert to series
series_df = df['sales']

#generate our own timeseries index if we wish
timeseries_df = pd.Series(df['sales'].values,
                     index = pd.date_range('31/12/2010' ,
                                           periods = 20,
                                           freq = 'A-DEC'))

timeseries_df

Now, I can simply plot the data. First, I calculate the cumulative sales over time and then I plot the annual sales data and the cumulative data onto the same chart.

cumulative_sales = np.cumsum(timeseries_df)
plt.figure(figsize=(12,8))
plt.plot(timeseries_df)
plt.plot(cumulative_sales)
plt.title('Sales over time')
plt.xlabel('Year')
plt.ylabel('Sales $')
plt.legend(['Annual Sales', 'Cumulative total'])

Now let’s get smoothing our data. The smoother, as the name suggests, smooths the data. We have two options: a simple moving average or an exponential weighted moving average.

A simple moving average takes the average value of observations within a given time window. For example, if you set the window size to 10; it would take the 10 observations around a datapoint and average it out. So if you had 10, 12, 11, 190, 13, 16, 21…, you would flatten out the obvious outlier of 190.

In the below, we use the Pandas rolling function to achieve this smoothing.

def plot_rolling(timeseries, window):
    rol_mean = timeseries.rolling(window).mean()
    rol_std = timeseries.rolling(window).std()
    
    fig = plt.figure(figsize = (12, 8))
    og = plt.plot(timeseries, color = "blue", label = "Original")
    mean = plt.plot(rol_mean, color = "red", label = "Rolling Mean")
    std = plt.plot(rol_std, color = "black", label = "Rolling Std")
    plt.legend(loc = "best")
    plt.title("Rolling Mean and Standard Deviation (window = "+str(window)+")")
    plt.show()

timeseries_df.rolling(10).mean()

The second optionw as an exponential weighted moving average, whereby the latest observations are given more weight. When we’re deciding that weighting we need to define Alpha. Alpha is a number between 1 and 0. When Alpha is equal to 1, it takes only the last observation as being relevant to the smoothing process. Whereas, when Alpha is 0, it takes all the old observations into account too – making a very smooth dataset- like setting a very high number in your moving average definition. So, an Alpha of 0.9 will result in a trend that is very close to original, while 0.5 will be much smoother.

def plot_ewma(timeseries, alpha):
    expw_ma = timeseries.ewm(alpha=alpha).mean()

    fig = plt.figure(figsize = (12, 8))
    og_line = plt.plot(timeseries, color = "blue", label = "Original")
    exwm_line = plt.plot(expw_ma, color = "red", label = "EWMA")
    plt.legend(loc = "best")
    plt.title("EWMA (alpha= "+str(alpha)+")")
    plt.show()

plot_ewma(timeseries_df, 0.9)
Kodey