There are various time-series data available, we will mainly use the datasets from:
R
’s fma package.Main packages required:
#Import the required modules for vectors and matrix operations, data generation
import numpy as np
#Import the required modules for plot creation:
import matplotlib.pyplot as plt
#import the required modules for TimeSeries data generation:
import statsmodels.api as sm
#Import the required modules for test statistic calculation:
import statsmodels.stats as sm_stat
#Import the required modules for model estimation:
import statsmodels.tsa as smt
# Import pandas dataset
import pandas as pd
The datasets that we will be analysing are:
airpass = sm.datasets.get_rdataset("AirPassengers", "datasets")
Look at its documentation
print(airpass.__doc__)
and extract the values
airpass = pd.Series(airpass.data["value"])
airpass.index = pd.date_range(start = "1949-01", periods = len(airpass.index), freq = "M").to_period()
airpass.index = airpass.index.to_timestamp()
airpass.head()
loti = sm.datasets.get_rdataset("loti", "gamclass")
Note that you will need to fix the format of the data, input the following code and examine the output:
loti.data.head()
#
loti = pd.melt(loti.data.iloc[: , list(range(0,12)) + [len(loti.data.columns)-1]], id_vars = ['Year'])
#
loti.head()
#
cats = ['Jan', 'Feb', 'Mar', 'Apr','May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
loti.variable = pd.Categorical(loti.variable, categories = cats)
loti = loti.sort_values(by = ['Year', 'variable'])
loti.head()
#
loti = pd.Series(loti["value"])
loti.index = pd.date_range(start = "1880-01", periods = len(loti.index), freq = "M").to_period()
loti.index = loti.index.to_timestamp()
loti.head()
austa = sm.datasets.get_rdataset("austa", "fpp2")
#
austa = pd.Series(austa.data["value"])
austa.index = pd.date_range(start = "1990-01", periods = len(austa.index), freq = "Y").to_period()
austa.index = austa.index.to_timestamp()
austourists = sm.datasets.get_rdataset("austourists", "fpp2")
#
austourists = pd.Series(austourists.data["value"])
austourists.index = pd.date_range(start = "1999-01", periods = len(austourists.index), freq = "Q").to_period()
austourists.index = austourists.index.to_timestamp()
beer = pd.Series([164, 148, 152, 144, 155, 125, 153, 146, 138, 190,
192, 192, 147, 133, 163, 150, 129, 131, 145, 137, 138, 168, 176,
188, 139, 143, 150, 154, 137, 129, 128, 140, 143, 151, 177, 184,
151, 134, 164, 126, 131, 125, 127, 143, 143, 160, 190, 182, 138,
136, 152, 127, 151, 130, 119, 153])
beer.index = pd.date_range(start = "1991-01", periods = len(beer.index), freq = "M").to_period()
beer.index = beer.index.to_timestamp()
shampoo = pd.Series([266, 145.9, 183.1, 119.3, 180.3, 168.5, 231.8,
224.5, 192.8, 122.9, 336.5, 185.9, 194.3, 149.5, 210.1, 273.3,
191.4, 287, 226, 303.6, 289.9, 421.6, 264.5, 342.3, 339.7, 440.4,
315.9, 439.3, 401.3, 437.4, 575.5, 407.6, 682, 475.3, 581.3, 646.9])
shampoo.index = pd.date_range(start = "1999-01", periods = len(shampoo.index), freq = "Q").to_period()
shampoo.index = shampoo.index.to_timestamp()
hsales = pd.Series([55, 60, 68, 63, 65, 61, 54, 52, 46, 42, 37, 30, 37,
44, 55, 53, 58, 50, 48, 45, 41, 34, 30, 24, 29, 34, 44, 54, 57,
51, 51, 53, 46, 46, 46, 39, 41, 53, 55, 62, 55, 56, 57, 59, 58,
55, 49, 47, 57, 68, 84, 81, 78, 74, 64, 74, 71, 63, 55, 51, 57,
63, 75, 85, 80, 77, 68, 72, 68, 70, 53, 50, 53, 58, 73, 72, 68,
63, 64, 68, 60, 54, 41, 35, 43, 44, 44, 36, 44, 50, 55, 61, 50,
46, 39, 33, 37, 40, 49, 44, 45, 38, 36, 34, 28, 29, 27, 29, 28,
29, 36, 32, 36, 34, 31, 36, 39, 40, 39, 33, 44, 46, 57, 59, 64,
59, 51, 50, 48, 51, 45, 48, 52, 58, 63, 61, 59, 58, 52, 48, 53,
55, 42, 38, 48, 55, 67, 60, 65, 65, 63, 61, 54, 52, 51, 47, 55,
59, 89, 84, 75, 66, 57, 52, 60, 54, 48, 49, 53, 59, 73, 72, 62,
58, 55, 56, 52, 52, 43, 37, 43, 55, 68, 68, 64, 65, 57, 59, 54,
57, 43, 42, 52, 51, 58, 60, 61, 58, 62, 61, 49, 51, 47, 40, 45,
50, 58, 52, 50, 50, 46, 46, 38, 37, 34, 29, 30, 40, 46, 46, 47,
47, 43, 46, 37, 41, 39, 36, 48, 55, 56, 53, 52, 53, 52, 56, 51,
48, 42, 42, 44, 50, 60, 66, 58, 59, 55, 57, 57, 56, 53, 51, 45,
58, 74, 65, 65, 55, 52, 59, 54, 57, 45, 40, 47, 47, 60, 58, 63,
64, 64, 63, 55, 54, 44])
hsales.index = pd.date_range(start = "1973-01", periods = len(hsales.index), freq = "M").to_period()
hsales.index = hsales.index.to_timestamp()
ustreas = pd.Series([91.57, 91.56, 91.40, 91.39, 91.20, 90.90, 90.99,
91.17, 90.98, 91.11, 91.20, 90.92, 90.73, 90.79, 90.86, 90.83,
90.80, 90.21, 90.10, 90.10, 89.66, 89.81, 89.40, 89.34, 89.16,
88.71, 89.28, 89.36, 89.64, 89.53, 89.32, 89.14, 89.41, 89.53,
89.16, 88.98, 88.50, 88.63, 88.62, 88.76, 89.07, 88.47, 88.41,
88.57, 88.23, 87.93, 87.88, 88.18, 88.04, 88.18, 88.78, 89.29,
89.14, 89.14, 89.42, 89.26, 89.37, 89.51, 89.66, 89.39, 89.02,
89.05, 88.97, 88.57, 88.44, 88.52, 87.92, 87.71, 87.52, 87.37,
87.27, 87.07, 86.83, 86.88, 87.48, 87.30, 87.87, 87.85, 87.83,
87.40, 86.89, 87.38, 87.48, 87.21, 87.02, 87.10, 87.02, 87.07,
86.74, 86.35, 86.32, 86.77, 86.77, 86.49, 86.02, 85.52, 85.02,
84.42, 85.29, 85.32])
elec = pd.Series([1254, 1290, 1379, 1346, 1535, 1555, 1655, 1651, 1500,
1538, 1486, 1394, 1409, 1387, 1543, 1502, 1693, 1616, 1841,
1787, 1631, 1649, 1586, 1500, 1497, 1463, 1648, 1595, 1777,
1824, 1994, 1835, 1787, 1699, 1633, 1645, 1597, 1577, 1709,
1756, 1936, 2052, 2105, 2016, 1914, 1925, 1824, 1765, 1721,
1752, 1914, 1857, 2159, 2195, 2287, 2276, 2096, 2055, 2004,
1924, 1851, 1839, 2019, 1937, 2270, 2251, 2382, 2364, 2129,
2110, 2072, 1980, 1995, 1932, 2171, 2162, 2489, 2424, 2641,
2630, 2324, 2412, 2284, 2186, 2184, 2144, 2379, 2383, 2717,
2774, 3051, 2891, 2613, 2600, 2493, 2410, 2390, 2463, 2616,
2734, 2970, 3125, 3342, 3207, 2964, 2919, 2764, 2732, 2622,
2698, 2950, 2895, 3200, 3408, 3679, 3473, 3154, 3107, 3052,
2918, 2786, 2739, 3125, 3033, 3486, 3661, 3927, 3851, 3456,
3390, 3280, 3166, 3080, 3069, 3340, 3310, 3798, 3883, 4191,
4213, 3766, 3628, 3520, 3322, 3250, 3287, 3552, 3440, 4153,
4265, 4655, 4492, 4051, 3967, 3807, 3639, 3647, 3560, 3929,
3858, 4485, 4697, 4977, 4675, 4596, 4491, 4127, 4144, 4014,
3994, 4320, 4400, 5002, 5091, 5471, 5193, 4997, 4737, 4546,
4498, 4350, 4206, 4743, 4582, 5191, 5457, 5891, 5618, 5158,
5030, 4800, 4654, 4453, 4440, 4945, 4788, 5425, 5706, 6061,
5846, 5242, 5408, 5114, 5042, 5008, 4657, 5359, 5193, 5891,
5980, 6390, 6366, 5756, 5640, 5429, 5398, 5413, 5141, 5695,
5554, 6369, 6592, 7107, 6917, 6353, 6205, 5830, 5646, 5379,
5489, 5824, 5907, 6482, 6795, 7028, 6776, 6274, 6362, 5940,
5958, 5769, 5887, 6367, 6165, 6868, 7201, 7601, 7581, 7090,
6841, 6408, 6435, 6176, 6138, 6717, 6470, 7312, 7763, 8171,
7788, 7311, 6679, 6704, 6724, 6552, 6427, 7105, 6869, 7683,
8082, 8555, 8386, 7553, 7398, 7112, 6886, 7077, 6820, 7426,
7143, 8261, 8240, 8977, 8991, 8026, 7911, 7510, 7381, 7366,
7414, 7824, 7524, 8279, 8707, 9486, 8973, 8231, 8206, 7927,
7999, 7834, 7521, 8284, 7999, 8940, 9381, 10078, 9796, 8471,
8572, 8150, 8168, 8166, 7903, 8606, 8071, 9178, 9873, 10476,
9296, 8818, 8697, 8381, 8293, 7942, 8001, 8744, 8397, 9115,
9773, 10358, 9849, 9083, 9143, 8800, 8741, 8492, 8795, 9354,
8796, 10072, 10174, 11326, 10744, 9806, 9740, 9373, 9244, 9407,
8827, 9880, 9364, 10580, 10899, 11687, 11280, 10208, 10212,
9725, 9721, 9846, 9407, 10265, 9970, 10801, 11246, 12167,
11578, 10645, 10613, 10104, 10348, 10263, 9973, 10803, 10409,
11458, 11845, 12559, 12070, 11221, 11338, 10761, 11012, 10923,
10790, 11427, 10788, 11772, 12104, 12634, 12772, 11764, 11956,
11646, 11750, 11485, 11198, 12265, 11704, 12419, 13259, 13945,
13839, 12387, 12546, 12038, 11977, 12336, 11793, 12877, 11923,
13306, 13988, 14002, 14336, 12867, 12721, 12449, 12686, 12810,
12015, 12888, 12431, 13499, 13014, 14296, 14125, 12817, 12862,
12449, 12489, 12621, 12380, 13023, 12302, 13339, 13825, 14428,
14151, 13355, 13094, 12656, 12435, 13287, 12434, 13209, 12817,
13746, 14259, 14590, 14354, 13254, 13464, 13302, 13456, 13171,
12517, 13489, 12509, 13785, 13921, 14603, 14749, 13540, 13457,
13243, 13590, 13487, 12776, 13812, 13032, 14268, 14473, 15359,
14457])
elec.index = pd.date_range(start = "1956-01", periods = len(elec.index), freq = "M").to_period()
elec.index = elec.index.to_timestamp()
Select a few datasets which appear interesting to you - either visually or in general and complete the following tasks.
1.1 Plot your time series. Does the data look stationary? Why?
1.2 Does the data exhibit autocorrelation? What could be the cause of this type of autocorrelation? (Hint:look at the ACF and PACF - think about what happens to them if a deterministic component is introduced).
1.3 As you know, there are two common decomposition models - additive and multiplicative. Which one would be appropriate for this data and why? (Hint: look at the data variation). If needed, transform your data.
After having an initial look at the data and carrying out any needed data transformations, move on to decompose the deterministic components of your time series. Remember that one requirement for the seasonal component with period \(d\) is that \(S_t = S_{t+d}\), i.e. the seasonal component itself must not increase pastime increases.
2.1 Decompose the data via a moving-average method. Plot the trend
and trend + seasonal
components alongside the actual data. Is the decomposed trend and seasonality captured correctly? Explain your answers. (Hint: look at the residual plots as well as their ACF and PACF)
Having an initial idea on the decomposition form and the deterministic components, use a couple of methods for TS decomposition, that allow forecasting. If needed, generate the trend
and seasonal
variables from your data.
3.1 Use the Global Method of OLS to estimate the trend and seasonal components of your data and plot the estimated values from your model. Does the model capture your trend and seasonality components well? Explain why.
3.2 Use the Local Method of Exponential Smoothing to estimate the deterministic components. Does the model capture the trend and seasonality components well? Explain why.
3.3 Compare the different decomposition methods - which one is better for your data? Explain your answer.
3.4 Select the best method and forecast your data 12
steps ahead.
4.1 If your data has a seasonality component - remove the trend (if a trend exists in your series) and estimate an \(SARIMA(p, 0, q)\times (P, 0, Q)_d\) model, where \(d\) is the frequency of your data (as long as its frequency \(d > 1\)). If there is no seasonality - examine the residuals and check whether an \(ARMA\) model is necessary.
4.2 Forecast the \(SARIMA\) (or the \(ARMA\), if there is no seasonality) process \(h = 20\) steps ahead. If necessary, combine with the trend forecast. Plot the fitted and the forecast values.
4.3 Select any other smoothing method (Kernel smoothing, LOWESS or Cubic splines) and estimate the trend in your data. Is the trend increasing, decreasing, or constant in your series?