Please note that you should have completed the Basic Python Tutorial beforehand.
The first subsection of this document repeats the same steps for time series analysis as in the examples with R. The code below uses econometric/statistical methods developed with Statsmodels.
The second chapter deals with GARCH model specification and estimation.
The required packages are provided at the start of the document.
#Import the required modules for data generation
import numpy as np
#Import the required modules for plot creation:
import matplotlib.pyplot as plt
#Import the required modules for DataFrame creation:
import pandas as pd
#import the required modules for TimeSeries data generation:
import statsmodels.api as sm
#Import the required modules for test statistic calculation:
import statsmodels.stats as sm_stat
#Import the required modules for model estimation:
import statsmodels.tsa as smt
We can generate the stationary process:
#Set the seed:
np.random.seed(1)
nsample = 200
x1 = np.random.normal(loc = 0.0, scale = 1.0, size = nsample)
#loc == mean; scale == sandard deviation
plt.plot(x1)
plt.title("WN, mean = 0")
plt.axhline(y = 0, color='r', linestyle='-')
plt.show()
By drawing the red horizontal line we can visually inspect whether the time series fluctuates around the mean.
We will generate the three examples of non-stationary time series:
np.random.seed(1)
n = 150
ns1 = np.random.normal(loc = 0.0, scale = 1.0, size = nsample) + np.arange(start = 1, stop = nsample + 1, step = 1)
ns2 = np.random.normal(loc = 0.0, scale = 1.0, size = nsample) * np.arange(start = 1, stop = nsample + 1, step = 1)
ns3 = np.cumsum(np.random.choice([-1, 1], size = nsample, replace = True))
#Note that for np.arrange() if we specify start = 1, end = n, then only (n-1) elements will be created
fig_size = [12, 5]
plt.rcParams["figure.figsize"] = fig_size
plt.subplot(2,1,1) # a 2-row, 1-column figure: go to the first subplot
plt.plot(ns1)
plt.title("non stationary in mean")
plt.subplot(2,2,3) # a 2-row, 2-column figure: go to the third subplot
plt.plot(ns2)
plt.title("non stationary in variance")
plt.subplot(2,2,4) # a 2-row, 2-column figure: go to the fourth subplot
plt.plot(ns3)
plt.title("no clear tendency")
#minimize the overlap of subplots (titles, axis labels etc.):
plt.tight_layout()
plt.show()
The three main cases of a non-stationary process: a non-constant mean, a changing variance. The last example is also non-stationary though it is not immediately apparent from a small sample size.
We can check how this kind of process would look like with different sample sizes (the vertical lines are there to distinguish where the sample intervals are in each plot):
np.random.seed(1)
#Generate the data
ns31 = np.cumsum(np.random.choice([-1, 1], size = 1000, replace = True))
#Plot different data samples:
fig_size = [12, 5]
plt.rcParams["figure.figsize"] = fig_size
plt.subplot(2,2,1)
plt.plot(ns31[0:99])
plt.axvline(x = 99, color='r', linestyle='-')
plt.subplot(2,2,2)
plt.plot(ns31[0:299])
plt.axvline(x = 99, color='r', linestyle='-')
plt.axvline(x = 299, color='g', linestyle='-')
plt.subplot(2,2,3)
plt.plot(ns31[0:599])
plt.axvline(x = 99, color='r', linestyle='-')
plt.axvline(x = 299, color='g', linestyle='-')
plt.axvline(x = 599, color='orange', linestyle='-')
plt.subplot(2,2,4)
plt.plot(ns31[0:999])
plt.axvline(x = 99, color='r', linestyle='-')
plt.axvline(x = 299, color='g', linestyle='-')
plt.axvline(x = 599, color='orange', linestyle='-')
plt.tight_layout()
plt.show()
We can see that, as we increase the sample size, the variance of the data increases.
Because we know that for this type of process $Var(Y_t) = t$, we can check this using a Monte Carlo simulation:
np.random.seed(1)
#A data.frame equivalent in Python:
df = pd.DataFrame()
for j in range(0, 1000):
temp_data = pd.DataFrame(np.cumsum(np.random.choice([-1, 1], size = 11, replace = True)))
temp_data.columns = ["X" + str(j+1)]
#Bind the data column-wise:
df = pd.concat([df.reset_index(drop=True), temp_data], axis = 1)
df.index = [name + str(number) for name in 'X' for number in range(1, 12)]
df.columns = ["Sample_" + str(number) for number in range(1, 1001)]
We can preview the created data.frame
and calculate the variance in each time period:
print("Number of columns: ", len(df.columns))
print("Number of rows: ", len(df.index))
print("Variance:")
print(df.var(axis = 1))
We can see that as t increases and we look at $X_1,X_2,...,X_{11}$, the variance also increases, which is in line with the theoretical calculations.
We can preview some of the data (note that we select the first five rows [ from index 0, to index 4] and the first 4 columns[from index 0 to index3]).
df.iloc [0:5, 0:4]
Here are some of the sample series plotted with different colors:
df.iloc[:, [0, 4, 8, 10]].plot(subplots=False, legend=True)
plt.show()
And here is how all of the series look like on the same plot:
df.plot(subplots=False, legend=False, color='k')
plt.show()
We see that we get a plot that resembles a board. If we increase our simulation number from 1000 to more (5000, 10000, etc.) then our plot would have some additional directories, which were not present in the simulation.
Plot the sample ACF and PACF of the non-stationary time series:
#Convert the data to a pandas DataFrame:
ns1 = pd.DataFrame(ns1, index = range(1, len(ns1) + 1))
ns2 = pd.DataFrame(ns2, index = range(1, len(ns1) + 1))
ns3 = pd.DataFrame(ns3, index = range(1, len(ns1) + 1))
fig = plt.figure(figsize=(16,4))
fig = sm.graphics.tsa.plot_acf(ns1, lags=40, ax = fig.add_subplot(131))
fig.gca().set_ylim([-0.2, 1.1]) #Allows us to change the y axis limit on the plot
fig = sm.graphics.tsa.plot_acf(ns2, lags=40, ax = fig.add_subplot(132))
fig = sm.graphics.tsa.plot_acf(ns3, lags=40, ax = fig.add_subplot(133))
fig = plt.figure(figsize=(16,4))
fig = sm.graphics.tsa.plot_pacf(ns1, lags=40, ax = fig.add_subplot(131))
fig = sm.graphics.tsa.plot_pacf(ns2, lags=40, ax = fig.add_subplot(132))
fig = sm.graphics.tsa.plot_pacf(ns3, lags=40, ax = fig.add_subplot(133))
The confidence intervals for the ACF, the standard deviation is calculated using Bartlett’s formula.
The confidence intervals for the PACF, the standard deviation is calculated using 1/sqrt(len(x)).
As an example, read the quarterly Canadian employment index data.
txt1 = "http://uosis.mif.vu.lt/~rlapinskas/(data%20R&GRETL/caemp.txt"
#Read the data from the url:
caemp = pd.read_csv(txt1)
#Add a time index:
caemp.index = pd.date_range(start = "1960-01", periods = len(caemp.index), freq = "M").to_period()
fig = plt.figure(figsize=(14,5))
caemp.plot(ax = fig.add_subplot(211), title = "Canadian Employment Data")
fig = sm.graphics.tsa.plot_acf(caemp, lags = 20, ax = fig.add_subplot(223))
fig = sm.graphics.tsa.plot_pacf(caemp, lags = 20, ax = fig.add_subplot(224))
plt.tight_layout()
We can test the WN hypothesis by running the Ljung-Box test statistic. Our null hypothesis:
$$H_0: \rho(1) = 0, ..., \rho(k) = 0$$
Let us test it for ACF up to lag 5, one by one:
sm_stat.diagnostic.acorr_ljungbox(caemp, lags = 5)
The first array is the X-squared
statistic. The second array is the associated p-values
.
rez = pd.DataFrame(list(sm_stat.diagnostic.acorr_ljungbox(caemp, lags = 5)), index = ["X-squared", "p-value"], columns = range(1, 6))
rez
In all cases, we reject the null hypothesis that all of the autocorrelation functions up to k
are zero.
Note that, if we reject the null hypothesis $H_0: \rho(1) = 0, ..., \rho(k) = 0$ - this means that at least one ACF is non-zero - this does not necessarily mean that all are non-zero.
An example of performing the Ljung-Box test on $\epsilon_t \sim \mathcal{N}(0,1)$:
np.random.seed(1)
#Correlation with the first lag:
rez = sm_stat.diagnostic.acorr_ljungbox(np.random.normal(loc = 0.0, scale = 1.0, size = 100) , lags = 1)
print(rez)
#Correlation with the 2nd lag only:
rez = sm_stat.diagnostic.acorr_ljungbox(np.random.normal(loc = 0.0, scale = 1.0, size = 100) , lags = [2])
print(rez)
p-value < 0.05
- we do not reject the null hypothesis $H_0: \rho(1) = 0$.
We will generate three MA models with $\epsilon_t \sim \mathcal{N}(0,1)$:
We can generate the processes in three ways: manually
or using ArmaProcess
from the statsmodels.tsa.arima_process
module.
Generating manually involves writing the code equivalent of the previously provided mathematical formulas:
np.random.seed(100)
n = 5000
epsilon = np.random.normal(loc = 0.0, scale = 1.0, size = n)
MA_m1 = 3 + epsilon + 0.5 * np.append([0], epsilon[:-1])
MA_m2 = epsilon + 1.3 * np.append([0], epsilon[:-1])
MA_m3 = epsilon - 1.3 * np.append([0], epsilon[:-1]) + 0.4 * np.append([0, 0], epsilon[:-2])
The mean of each sample is:
MA_m1.mean()
MA_m2.mean()
MA_m3.mean()
which is what we expect (Note: you can check this by calculating $\mathbb{E}Y_t$ for each case).
The variance of the second process:
MA_m2.var()
is very close to the theoretical value $Var(Y_t) = (1 + \theta^2)\sigma^2 = 1 + (1.3)^2 = 2.69$.
The ACF and PACF or the three different models:
fig = plt.figure(figsize=(14,5))
#ACF plots:
sm.graphics.tsa.plot_acf(MA_m1, lags = 40, ax = fig.add_subplot(231), title = '$ Y_{t} = 3 + \epsilon_t + 0.5 \epsilon_{t-1}$')
plt.ylabel('ACF')
sm.graphics.tsa.plot_acf(MA_m2, lags = 40, ax = fig.add_subplot(232), title = '$ Y_{t} = \epsilon_t + 1.3 \epsilon_{t-1}$')
plt.ylabel('ACF')
sm.graphics.tsa.plot_acf(MA_m3, lags = 40, ax = fig.add_subplot(233), title = '$ Y_{t} = \epsilon_t - 1.3 \epsilon_{t-1} + 0.4 \epsilon_{t-2}$')
plt.ylabel('ACF')
#PACF plots:
sm.graphics.tsa.plot_pacf(MA_m1, lags = 40, ax = fig.add_subplot(234), title = '$ Y_{t} = 3 + \epsilon_t + 0.5 \epsilon_{t-1}$')
plt.ylabel('PACF')
sm.graphics.tsa.plot_pacf(MA_m2, lags = 40, ax = fig.add_subplot(235), title = '$ Y_{t} = \epsilon_t + 1.3 \epsilon_{t-1}$')
plt.ylabel('PACF')
sm.graphics.tsa.plot_pacf(MA_m3, lags = 40, ax = fig.add_subplot(236), title = '$ Y_{t} = \epsilon_t - 1.3 \epsilon_{t-1} + 0.4 \epsilon_{t-2}$')
plt.ylabel('PACF')
plt.tight_layout()
For MA(1) the ACF has a sharp cutoff after lag = 1
, for MA(2) - the sharp cutoff is after lag = 2
. The PACF of the MA(q) process is slowly decaying to zero.
Note that the second equation does not meet the requirement that $|\theta| < 1$. Therefore, it is not invertible, i.e. it cannot be expressed as an infinite-order AR process.
We will generate three AR models with $\epsilon_t \sim \mathcal{N}(0,1)$:
We will also give an example of model generation using using ArmaProcess
module:
np.random.seed(100)
n = 5000
epsilon = np.random.normal(loc = 0.0, scale = 1.0, size = n)
AR_m1 = np.array(3.0 + epsilon[0], ndmin = 1)
AR_m2 = np.array(epsilon[0], ndmin = 1)
for j in range(1, n):
AR_m1 = np.append(AR_m1, 3 + 0.5 * AR_m1[j -1] + epsilon[j])
AR_m2 = np.append(AR_m2, 1.2 * AR_m2[j - 1] + epsilon[j])
#Specify the model:
AR_m3_spec = smt.arima_process.ArmaProcess(ar = np.array([1, -1.2, 0.5]), ma = [1])
#Generate the data sample:
AR_m3 = AR_m3_spec.generate_sample(n)
For explanations on the coefficient inclusion in the ArmaProcess
function, see the documentation and the second section for ARMA model generation.
We also get a RuntimeWarning
about an overflowing range of possible values. These values are automatically marked as Inf
:. For example, the last number in the second model:
AR_m2[-1]
To make it easier to analyse the second model, we will reduce the sample of the generated data:
AR_m2 = AR_m2[0:100]
plt.plot(AR_m2)
plt.show()
Because the coefficient $\phi = 1.2 \geq 1$ the time series declines exponentially (the growth (from R
examples) or decline also depends on the signs of the initial shocks $\epsilon_1, \epsilon_2,...$).
The mean of each sample is:
print(AR_m1.mean())
print(AR_m2.mean())
print(AR_m3.mean())
Remember that for the generalized AR(1) case (with constant component) the mean is $m = \alpha/(1-\phi)$. For the first model, we have that $\alpha = 3$ and $\phi = 0.5$ $\Rightarrow$ $m = 6$ (which is similar to what the sample mean of the first model is).
Note that the second model violates the requirement that $|\phi| < 1$. This kind of AR process is not stationary.
Trying to generate this kind of model using ArmaProcess
is possible:
np.random.seed(100)
#Specify the model:
AR_m2_spec = smt.arima_process.ArmaProcess(ar = np.array([1, -1.2]), ma = [1])
#Check if it is stationary:
print("Stationarity: ", AR_m2_spec.isstationary)
plt.plot(AR_m2_spec.generate_sample(50))
plt.show()
We can also immediately test for stationarity using the isstationary
method. Eventhough it is not a stationary process, we can still generate it using the provided method.
The ACF and PACF:
fig = plt.figure(figsize=(14,5))
#ACF plots:
sm.graphics.tsa.plot_acf(AR_m1, lags = 20, ax = fig.add_subplot(231), title = '$Y_t = 3 + 0.5 Y_{t-1} + \epsilon_t$')
plt.ylabel('ACF')
sm.graphics.tsa.plot_acf(AR_m2, lags = 20, ax = fig.add_subplot(232), title = '$Y_t = 1.2 Y_{t-1} + \epsilon_t$')
plt.ylabel('ACF')
sm.graphics.tsa.plot_acf(AR_m3, lags = 20, ax = fig.add_subplot(233), title = '$Y_t = 1.2 Y_{t-1} - 0.5 Y_{t-2} + \epsilon_t$')
plt.ylabel('ACF')
#PACF plots:
sm.graphics.tsa.plot_pacf(AR_m1, lags = 20, ax = fig.add_subplot(234), title = '$Y_t = 3 + 0.5 Y_{t-1} + \epsilon_t$')
plt.ylabel('PACF')
sm.graphics.tsa.plot_pacf(AR_m2, lags = 20, ax = fig.add_subplot(235), title = '$Y_t = 1.2 Y_{t-1} + \epsilon_t$')
plt.ylabel('PACF')
sm.graphics.tsa.plot_pacf(AR_m3, lags = 20, ax = fig.add_subplot(236), title = '$Y_t = 1.2 Y_{t-1} - 0.5 Y_{t-2} + \epsilon_t$')
plt.ylabel('PACF')
plt.tight_layout()
For the AR(p) models, the ACF decays gradually to zero. For the AR(1), the PACF has a sharp cutoff after lag = 1
.
For the AR(2), the PACF has a sharp cutoff after lag = 2
.
Note that even though the second model is not stationary, the PACF is cut-off after the first lag and the ACF is slowly decaying (because the coefficient $\phi > 1$ the decay is slower compared to the first model with $\phi = 0.5$)
We will generate an ARMA model:
arparams = np.array([0.4])
maparams = np.array([0.6])
The conventions of the ArmaProcess
function require that we specify a 1 for the zero-lag of the AR and MA parameters and that the AR parameters be negated, see the documentation:
Both the AR and MA components should include the coefficient on the zero-lag. This is typically 1. Further, due to the conventions used in signal processing used in signal.lfilter vs. conventions in statistics for ARMA processes, the AR paramters should have the opposite sign of what you might expect
arparams = np.r_[1, -arparams] # a quick way to build arrays quickly - by using "r_"
maparams = np.r_[1, maparams]
#Specify the model:
ARMA_spec = smt.arima_process.ArmaProcess(ar = arparams, ma = maparams)
#Check if it is stationary:
print("Stationarity: ", ARMA_spec.isstationary)
#Check if it is invertible
print("Invertibility: ", ARMA_spec.isinvertible)
#Generate the data sample:
ARMA_m1 = ARMA_spec.generate_sample(n)
The ACF and PACF:
fig = plt.figure(figsize=(10,5))
#ACF plots:
sm.graphics.tsa.plot_acf(ARMA_m1, lags = 20, ax = fig.add_subplot(211), title = '$Y_t = 0.4 Y_{t-1} + \epsilon_t + 0.6 \epsilon_{t-1}$')
plt.ylabel('ACF')
#PACF plots:
sm.graphics.tsa.plot_pacf(ARMA_m1, lags = 20, ax = fig.add_subplot(212), title = '$Y_t = 0.4 Y_{t-1} + \epsilon_t + 0.6 \epsilon_{t-1}$')
plt.ylabel('PACF')
plt.tight_layout()
Both ACF and PACF decline for the ARMA process.
statsmodels.tsa.arima_model.ARMA
¶Let’s say we do not know the data generating process that generated out previously mentioned AR, MA and ARMA models. If we examined the ACF and PACF graphs and we have an idea of what our underlying model might be, we can use the tsa.arima_model.ARMA
MA_m1_est = smt.arima_model.ARMA(MA_m1, order = (0, 1))
MA_m1_mdl = MA_m1_est.fit(trend = 'c')
MA_m2_est = smt.arima_model.ARMA(MA_m2, order = (0, 1))
MA_m2_mdl = MA_m2_est.fit(trend = 'nc')
Because the third MA process is not invertible, we need to specify transparams = False
, otherwise we will get an error. We also need to specify some starting aprameters. For this example, we select start_params = [-1, 0.5]
- they are somehwt close to the parameters which we used to generate the data:
MA_m3_est = smt.arima_model.ARMA(MA_m3, order = (0, 2))
MA_m3_mdl = MA_m3_est.fit(trend = 'nc', transparams = False, start_params = [-1, 0.5])
AR_m1_est = smt.arima_model.ARMA(AR_m1, order = (1, 0))
AR_m1_mdl = AR_m1_est.fit(trend = 'c')
AR_m2_est = smt.arima_model.ARMA(AR_m2, order = (1, 0))
AR_m2_mdl = AR_m2_est.fit(trend = 'nc')
AR_m3_est = smt.arima_model.ARMA(AR_m3, order = (2, 0))
AR_m3_mdl = AR_m3_est.fit(trend = 'nc')
ARMA_m1_est = smt.arima_model.ARMA(ARMA_m1, order = (1, 1))
ARMA_m1_mdl = ARMA_m1_est.fit(trend = 'nc')
The summary statistics of our models:
print(MA_m1_mdl.summary())
We can extract specific information from this class:
print("Included trend (this is actually the constant): ", MA_m1_mdl.k_trend)
print("Included constat: ", MA_m1_mdl.k_constant)
print("Number of AR parameters: ", MA_m1_mdl.k_ar)
print("Number of MA parameters" , MA_m1_mdl.k_ma)
print("Included exogenous variables: ", MA_m1_mdl.k_exog)
print("All of the parameters: ", MA_m1_mdl.params)
print("All of the p-values: ", MA_m1_mdl.pvalues)
print("Only AR parameters: ", MA_m1_mdl.arparams)
print("Only MA parameters: ", MA_m1_mdl.maparams)
If we only try to extract the coefficients:
print(ARMA_m1_mdl.params)
We do not get the names of the coefficients (at least this is the case with the current working environment).
To only leave the relevant info, we only take the coefficients and their standard errors and define the following method to get the coefficient names:
def get_coef(my_fitted_model):
#Get the parameters and their pvalues
rez = np.array([my_fitted_model.params, my_fitted_model.pvalues]).T
#Set the parameter names
my_index = None
#Check for specific parameters:
if my_fitted_model.k_trend > 0:
my_index = np.append(my_index, "Const")
if my_fitted_model.k_ar > 0:
my_index = np.append(my_index, ["AR_L" + str(number) for number in range(1, my_fitted_model.k_ar + 1)])
if my_fitted_model.k_ma > 0:
my_index = np.append(my_index, ["MA_L" + str(number) for number in range(1, my_fitted_model.k_ma + 1)])
#return a pandas DataFrame
df = pd.DataFrame(rez)#, index = my_index, columns = ["coef", "p_value"])
df.index = my_index[1:]
df.columns = ["coef", "p_value"]
return df
Now we can use this method to extract only the relevant model information:
get_coef(MA_m1_mdl)
get_coef(MA_m2_mdl)
get_coef(MA_m3_mdl)
We can see that our estimated coefficients are close to the actual values, except for the second process, which is not invertible.
For AR
models:
get_coef(AR_m1_mdl)
get_coef(AR_m2_mdl)
get_coef(AR_m3_mdl)
Note that the second process is not stationary. As such the estimated value of $\phi$ is close to 1.
When estimating AR, MA or ARMA processes we assume that they are stationary (if they are not, then we need to create such models or transform the data such that it would be stationary).
Note that the Const
coefficient of the first $AR(1)$ model is close to the process mean m = 6
.
get_coef(ARMA_m1_mdl)
If we are not sure what order of AR, MA or ARMA model would best fit our data, we could use an automated process to select the best possible model (in terms of either AIC or BIC), if such a method is available.
The closest equivalent to R's
auto.arima
for use in Python
is the statsmodels.tsa.stattools.arma_order_select_ic
module.
For example, it works with our second and third MA process data samples, one of which is not invertible:
MA_m1_est_auto = sm.tsa.arma_order_select_ic(MA_m1, ic='aic', trend='c')
print("1st MA process order: ", MA_m1_est_auto.aic_min_order)
MA_m2_est_auto = sm.tsa.arma_order_select_ic(MA_m2, ic='aic', trend='nc')
print("2nd MA process order: ", MA_m2_est_auto.aic_min_order)
MA_m3_est_auto = sm.tsa.arma_order_select_ic(MA_m3, ic='aic', trend='nc')
print("3rd MA process order: ", MA_m3_est_auto.aic_min_order)
Unfortunately, this method is not fully implemented and sometimes errors may arise (especially when the process is not stationary). With our AR
process data:
AR_m1_est_auto = sm.tsa.arma_order_select_ic(AR_m1, ic='aic', trend='c')
print("1st AR process order: ", AR_m1_est_auto.aic_min_order)
AR_m2_est_auto = sm.tsa.arma_order_select_ic(AR_m2, ic='aic', trend='nc')
print("2nd AR process order: ", AR_m2_est_auto.aic_min_order)
Because the second AR process is not stationary, we get a long list of errors and the closest stationary model for this dataset.
AR_m3_est_auto = sm.tsa.arma_order_select_ic(AR_m3, ic='aic', trend='nc')
print("3rd AR process order: ", AR_m3_est_auto.aic_min_order)
If we check our third AR process, which we generated with smt.arima_process.ArmaProcess
, we can see that it is both stationary and invertible:
print("Invertibility: ", AR_m3_spec.isinvertible)
print("Stationarity: ", AR_m3_spec.isstationary)
So, the automatic model selection is not always correct. In the third AR process case we got a convergence warning when performing the likelihood function optimization.
With our ARMA
process data the model order is found without any warnings or errors:
ARMA_m1_est_auto = sm.tsa.arma_order_select_ic(ARMA_m1, ic='aic', trend='nc')
print("ARMA process order: ", ARMA_m1_est_auto.aic_min_order)
The output is the order of the $ARMA(p, q)$ model. We can then use the previously used smt.arima_model.ARMA
module with the specified order. As mentioned, the calculations take longer compared to R's
auto.arima
.
Having estimated the models, we would like to create our time series forecasts:
Lets say we want to forecast 5 periods into the future. Since we have 5000 historic data points we would also want to reduce the historic data in the plot to, say 20.
Our code would look like:
fig = plt.figure(figsize=(14,8))
MA_m1_mdl.plot_predict(start = MA_m1.size - 20, end = MA_m1.size + 5, ax = fig.add_subplot(331)) #ax = Existing axes to plot with.
plt.title('$MA:\ Y_t = 3 + \epsilon_t + 0.5 \epsilon_{t-1}$')
plt.legend().remove()
MA_m2_mdl.plot_predict(start = MA_m2.size - 20, end = MA_m2.size + 5, ax = fig.add_subplot(332)) #ax = Existing axes to plot with.
plt.title('$MA:\ Y_t = \epsilon_t + 1.3 \epsilon_{t-1}$')
plt.legend().remove()
MA_m3_mdl.plot_predict(start = MA_m3.size - 20, end = MA_m3.size + 5, ax = fig.add_subplot(333)) #ax = Existing axes to plot with.
plt.title('$MA:\ Y_t = \epsilon_t - 1.3 \epsilon_{t-1} + 0.4 \epsilon_{t-2}$')
plt.legend().remove()
AR_m1_mdl.plot_predict(start = AR_m1.size - 20, end = AR_m1.size + 5, ax = fig.add_subplot(334)) #ax = Existing axes to plot with.
plt.title('$AR:\ Y_t = 3 + 0.5 Y_{t-1} + \epsilon_t$')
plt.legend().remove()
AR_m2_mdl.plot_predict(start = AR_m2.size - 20, end = AR_m2.size + 5, ax = fig.add_subplot(335)) #ax = Existing axes to plot with.
plt.title('$AR:\ Y_t = 1.2 Y_{t-1} + \epsilon_t$')
plt.legend().remove()
AR_m3_mdl.plot_predict(start = AR_m3.size - 20, end = AR_m3.size + 5, ax = fig.add_subplot(336)) #ax = Existing axes to plot with.
plt.title('$AR:\ Y_t = 1.2 Y_{t-1} - 0.5 Y_{t-2} + \epsilon_t$')
plt.legend().remove()
ARMA_m1_mdl.plot_predict(start = ARMA_m1.size - 20, end = ARMA_m1.size + 5, ax = fig.add_subplot(313)) #ax = Existing axes to plot with.
plt.title('$ARMA:\ Y_t = 0.4 Y_{t-1} + \epsilon_t + 0.6 \epsilon_{t-1}$')
plt.legend(loc = 'upper center')
plt.tight_layout()
Because the 2nd AR process is not stationary, the forecasts differ quite a bit from the historical data.
Otherwise, AR and MA model forecasts perform as we would expect: MA forecasts quickly tend to the mean, whereas AR forecasts tend to the average but never reach it.
After estimating our models, it is also important to check if the residuals are WN. It would be usefull to have an equivalent to tsdisplay
from R
, so let's define a function:
def tsdisplay(y, figsize = (14,8), title = ""):
tmp_data = pd.Series(y)
fig = plt.figure(figsize = figsize)
#Plot the time series
tmp_data.plot(ax = fig.add_subplot(311), title = "$Time\ Series\ " + title + "$", legend = False)
#Plot the ACF:
sm.graphics.tsa.plot_acf(tmp_data, lags = 20, zero = False, ax = fig.add_subplot(323))
#Plot the PACF:
sm.graphics.tsa.plot_pacf(tmp_data, lags = 20, zero = False, ax = fig.add_subplot(324))
#Plot the QQ plot of the data:
sm.qqplot(tmp_data, line='s', ax = fig.add_subplot(325))
plt.title("QQ Plot")
#Plot the residual histogram:
fig.add_subplot(326).hist(tmp_data, bins = 50, normed = 1)
plt.title("Histogram")
#Fix the layout of the plots:
plt.tight_layout()
Firstly, let us look at our $ARMA(1,1)$ model:
tsdisplay(ARMA_m1_mdl.resid, title = ":\ Residuals\ Of\ Y_t = 0.4 Y_{t-1} + \epsilon_t + 0.6 \epsilon_{t-1}")
The residuals appear to not be autocorrelated. The QQ plot suggests that they are standard normal. We can conclude that these residuals are $WN$.
Now, let's look at the MA process which is not invertible:
tsdisplay(MA_m2_mdl.resid, title = ":\ Residuals\ Of\ Y_t = \epsilon_t + 1.3 \epsilon_{t-1}")
As stated in the theory slides, the MA process is stationary, so the residuals are $WN$ regardless if the process is invertible or not.
Finally, lets look at the nonstationary AR process:
tsdisplay(AR_m2_mdl.resid, title = ":\ Residuals\ Of\ Y_t = 1.2 Y_{t-1} + \epsilon_t")
Because we are fitting a stationary AR model on a nonstationary time series and in doing so, we misspecify the model, the residuals are not $WN$.
We can also carry out the Ljung-Box
tests.
For the ARMA series:
pd.DataFrame(list(sm_stat.diagnostic.acorr_ljungbox(pd.Series(ARMA_m1_mdl.resid), lags = 10)), index = ["X-squared", "p-value"], columns = range(1, 11))
We can see that we do not reject the null hypothesis of an uncorrelated series.
For the non-invertible MA process:
pd.DataFrame(list(sm_stat.diagnostic.acorr_ljungbox(pd.Series(MA_m2_mdl.resid), lags = 10)), index = ["X-squared", "p-value"], columns = range(1, 11))
Because p-value > 0.05
for the first 10 lags, we do not reject the null hypothesis of an uncorrelated series.
For the nonstationary AR process:
pd.DataFrame(list(sm_stat.diagnostic.acorr_ljungbox(pd.Series(AR_m2_mdl.resid), lags = 10)), index = ["X-squared", "p-value"], columns = range(1, 11))
The p-value < 0.05
, we reject the null hypothesis, which means that the residuals are serially correlated (i.e. autocorrelated).
Another example, this time by fitting an incorrect model.
Remember our MA(1) model which is not invertible $Y_t = \epsilon_t + 1.3 \epsilon_{t-1}$?
Let us try to fit an AR(1) model and check its residuals:
bad_est = smt.arima_model.ARMA(MA_m2, order = (1, 0))
bad_mdl = bad_est.fit(trend = 'nc')
tsdisplay(bad_mdl.resid, title = ":\ Misspecified\ model")
pd.DataFrame(list(sm_stat.diagnostic.acorr_ljungbox(pd.Series(bad_mdl.resid), lags = 10)), index = ["X-squared", "p-value"], columns = range(1, 11))
We can see that while the residuals appear to be standard normal, they are actually autocorrelated. This can be seen from both the ACF/PACF as well as from the Ljung-Box
test results, where the p-value < 0.05 for different lags leads us to reject the null hypothesis of no autocorrelation.
In this section we will present an example of esimating ARCH and GARCH models with Python. We will use the arch_model
function from the arch
package.
To install the package, follow the documentation. Open the Anaconda Navigator, Go to 'Environments' and click the arrow near the base(root)
. Then enter the following:
conda install arch -c bashtage
Note: OS should be 64-bit!
Now we can import the required package:
import arch as arch
#arch.doc()
We can think of $ARCH$ models as $AR$ models applied to the error variance of a time series $r_t$.
$ARCH(1)$ process:
$$ \begin{cases} r_t &= \mu + \epsilon_t \\ \epsilon_t &= \sigma_t w_t \\ \sigma_t^2 &= \omega + \alpha_1 \epsilon^2_{t-1} \end{cases} $$
np.random.seed(123)
#Define the sample size:
n = 2000
# Define the coefficients:
mu = 1.0
omega = 2.0
alpha = 0.5
# Generate the random shocks
w = np.random.normal(size = n)
# Generate the ARCH process:
r_t = np.array(mu + np.sqrt(omega) * w[0], ndmin = 1)
for j in range(1, n):
r_t = np.append(r_t, mu + np.sqrt(omega + alpha * ((r_t[j-1] - mu)**2)) * w[j])
First, we plot the ACF and PACF of the time series:
tsdisplay(r_t, title = "\ of\ r_t")
The first lags appear to be slightly correlated. Now let's plot the squared time series:
tsdisplay(r_t**2, title = "\ of\ r_t^2")
If we wanted to estimate an ARCH model on this data, we would first have to create a mean
model for the returns r_t
. Then we should inspect the residuals - if their ACF and PACF plots show no autocorrelation but the squared residual ACF and PACF plots - show autocorrelation - we should create an ARCH model for the residuals. Look at the theory slides for the detailed steps and try to to them for this model.
Use the following example of a $GARCH$ model and apply the same steps to the $ARCH$ process.
$GARCH(1, 1)$ is an $ARMA$ model applied to the error variance of a time series:
$$ \begin{cases} r_t &= \mu + \epsilon_t \\ \epsilon_t &= \sigma_t w_t \\ \sigma_t^2 &= \omega + \alpha_1 \epsilon^2_{t-1} + \beta_1 \sigma^2_{t-1} \end{cases} $$
np.random.seed(1)
#Define the sample size:
n = 2000
# Define the coefficients:
mu = 1.0
omega = 0.2
alpha = 0.5
beta = 0.3
# Generate the random shocks
w = np.random.normal(size = n)
sigma_sq = np.zeros_like(w)
# Generate the GARCH process:
sigma_sq[0] = omega
r_t = np.array(mu + np.sqrt(sigma_sq[0]) * w[0], ndmin = 1)
for j in range(1, n):
sigma_sq[j] = omega + alpha * ((r_t[j-1] - mu)**2 + beta * sigma_sq[j-1])
r_t = np.append(r_t, mu + np.sqrt(sigma_sq[j]) * w[j])
tsdisplay(r_t, title = "\ of\ r_t")
tsdisplay(r_t**2, title = "\ of\ r_t^2")
We note, that the residuals are WN but the squared residuals are not.
We begin by creating the model for the mean
by looping through all possible ARMA(p, q) model combinations with $p_{max} = q_{max} = 4$:
best_aic = np.inf
best_order = None
best_mdl = None
pq_rng = range(5) # [0,1,2,3,4]
for i in pq_rng:
for j in pq_rng:
try:
tmp_mdl = smt.api.ARIMA(r_t, order = (i,0,j)).fit(method = 'mle', trend = 'c')
tmp_aic = tmp_mdl.aic
if tmp_aic < best_aic:
best_aic = tmp_aic
best_order = (i, 0, j)
best_mdl = tmp_mdl
except: continue
print("The best ARIMA model is the ARIMA(%d, %d, %d)" % best_order)
#mean_model = smt.api.ARIMA(r_t, order = best_order).fit(method = 'mle', trend = 'c')
#tsdisplay(mean_model.resid, title = "\ of\ residuals")
#tsdisplay(mean_model.resid**2, title = "\ of\ residuals-squared")
print("The mean of the series is ", r_t.mean())
As we have seen from the ACF and PACF plot of r_t
, there is no need to create an ARMA model for the mean model
and we got the same results from our ARMA order selection code.
While the model residuals appear to be $WN$, the squared residuals exhibit autocorrelation, so we need to fit a GARCH model on the series. Note, however, that the mean of the series is not zero, so we need to fit a mean
model with a constant:
# First, specify the mean model - possible either 'ConstantMean()' or 'ZeroMean()':
am = arch.univariate.ConstantMean(r_t)
# Second, specify the volatility model:
am.volatility = arch.univariate.GARCH(p = 1, q = 1)
# Finally, specify the residual distribution:
am.distribution = arch.univariate.Normal()
output = am.fit()
print(output.summary())
We can look at the coefficients and see that they are similar to the ones we used to generate the series ($\beta$ coefficient differs the most):
print(output.params)
print(output.aic)
The evaluated model is:
$$ \begin{cases} r_t &= 1.015481 + \epsilon_t \\ \epsilon_t &= \sigma_t w_t \\ \sigma_t^2 &= 0.180947 + 0.511791 \epsilon^2_{t-1} + 0.192441 \sigma^2_{t-1} \end{cases} $$
If we needed, we could also estimate an $AR(1)-GARCH(1, 1)$ model:
am_tmp = arch.univariate.ARX(r_t, lags = 1)
am_tmp.volatility = arch.univariate.GARCH(p = 1, q = 1)
am_tmp.distribution = arch.univariate.Normal()
tmp_mdl = am_tmp.fit(disp = 'off')
print(tmp_mdl.params)
print(tmp_mdl.aic)
In this case the coefficients are very similar to the actual ones, with an additional $\phi_1 = - 0.006137$ coefficient in the mean equation:
$$ \begin{cases} r_t &= 1.021405 -0.006137 r_{t-1} + \epsilon_t \\ \epsilon_t &= \sigma_t w_t \\ \sigma_t^2 &= 0.181230 + 0.511749 \epsilon^2_{t-1} + 0.191995 \sigma^2_{t-1} \end{cases} $$
We can see that the $AIC$ is slightly smaller for the $AR(1)-GARCH(1,1)$ model, compared to the $GARCH(1, 1)$ model, which would indicate that the second model is slightly better in terms of $AIC$ value.
We can also look at the standardized residuals:
std_resid = tmp_mdl.resid[1:] / tmp_mdl.conditional_volatility[1:]
tsdisplay(std_resid, title = "\ of\ standardized\ residuals")
tsdisplay(std_resid**2, title = "\ of\ standardized\ residuals-squared")
We can use the pandas_datareader
package to get some sample data from the web using the Yahoo Finance API. Package Page.
Open the Anaconda Navigator, Go to 'Environments' and click the arrow near the base(root)
. Then enter the following:
conda install -c anaconda pandas-datareader
import pandas_datareader.data as web
After this, we can download some stock data
start = '2010-01-01'
end = '2018-01-01'
GOOG = web.DataReader('GOOG', 'yahoo', start=start, end=end)
GOOG.head(10)
Then, we can calculate the returns:
returns = np.log(GOOG['Adj Close'] / GOOG['Adj Close'].shift(1)).dropna()
returns.head(10)
#print(GOOG['Adj Close'].head(2))
#print(GOOG['Adj Close'].shift(1).head(3))
#print(GOOG['Adj Close'].shift(-1).head(3))
print(returns.plot())
We can then use the data for model estimation. Some example commands:
arch_m = arch.arch_model(returns)
arch_fit = arch_m.fit()
print(arch_fit.summary())
print(arch_fit.plot())