We shall re-do the example from the lecture slides.
We have data collected on a monthly basis over five years (i.e. 60 months) on the following variables:
Company B
(in '000 's of dollars);#Import the required modules for data generation
import numpy as np
#Import the required modules for plot creation:
import matplotlib.pyplot as plt
#Import the required modules for DataFrame creation:
import pandas as pd
#Import the required modules for TimeSeries data generation:
import statsmodels.api as sm
#Import the required modules for test statistic calculation:
import statsmodels.stats as sm_stat
#Import the required modules for model estimation:
import statsmodels.tsa as smt
#Import the required modules for optimization:
import scipy.optimize as optimize
#We also need additional data:
import statsmodels.formula.api as smf
txt1 = "http://uosis.mif.vu.lt/~rlapinskas/(data%20R&GRETL/badnews.xls"
#Read the data from the url:
BADNEWS = pd.read_excel(txt1)
#Add a time index:
BADNEWS.index = pd.date_range(start = "1990-01", periods = len(BADNEWS.index), freq = "M").to_period()
#Example of different date formats:
print(BADNEWS.index[:4])
BADNEWS.index = BADNEWS.index.to_timestamp()
print(BADNEWS.index[:4])
BADNEWS.columns = ["capital", "price"]
BADNEWS.head()
fig = plt.figure()
BADNEWS.plot(subplots=True, layout=(2,1))
plt.legend()
plt.tight_layout()
plt.show()
Since this is time series data, it is likely that previous month's news about oil price (i.e. the price of oil at time $t-1$) will affect the current market capitalization. As such, it is necessary to include lags of $X$ in the regression.
Below we present OLS estimates of the coefficients on a distributed lag model in which market capitalization is allowed to depend on present news about the oil price and news up to $q_{max} = 4$ months ago. That it: $$\text{capit}_{t} = \alpha + \beta_0 \cdot \text{price}_t + \beta_1 \cdot \text{price}_{t-1} + ... + \beta_4 \cdot \text{price}_{t-4} + \epsilon_t,\quad \epsilon_t \sim WN(0, \sigma^2)$$
def lag(x, n):
if n == 0:
return x
if isinstance(x, pd.Series):
return x.shift(n)
else:
x = pd.Series(x)
return x.shift(n)
x = x.copy()
x[n:] = x[0:-n]
x[:n] = np.nan
return x
mod_L4_est = smf.ols(formula = 'capital ~ 1 + lag(price, 0) + lag(price, 1) + lag(price, 2) + lag(price, 3) + lag(price, 4)', data = BADNEWS)
mod_L4_fit = mod_L4_est.fit()
print(mod_L4_fit.summary())
Since the p-value
corresponding to the explanatory variable $\text{price}_{t-4}$ is greater than 0.05, we cannot reject the null hypothesis that $\beta_4 = 0$ at the 5% level of significance. Accordingly, we drop this variable from the model and re-estimate the model with lag length equal to 3, yielding the following results:
mod_L3_est = smf.ols(formula = 'capital ~ 1 + lag(price, 0) + lag(price, 1) + lag(price, 2) + lag(price, 3)', data = BADNEWS)
mod_L3_fit = mod_L3_est.fit()
print(mod_L3_fit.summary())
print(np.round(mod_L3_fit.params, 4))
print(np.round(mod_L3_fit.pvalues, 4))
The p-value
for testing $\beta_3 = 0$ is 0.0003 < 0.05
. We therefore conclude that the variable $\text{price}_{t-3}$ does indeed belong in the distributed lag model.
Hence, $q = 3$ is the lag length we select for this model.
In a formal report, we would present this table of results, or the equation: $$\text{capit}_{t} = 90402.2210 -125.9000 \cdot \text{price}_t - 443.4918 \cdot \text{price}_{t-1} -417.6089 \cdot \text{price}_{t-2} -179.9043 \cdot \text{price}_{t-4}$$
The interpretation is similar to the one presented in the lecture slides - increasing the oil price by 1 dollar per barrel in a given month $t$ is associated with:
In general, consideri the following model: $$Y_t = \alpha + \phi Y_{t-1} + \beta_0 X_t + \beta_1 X_{t-1} + \epsilon_t,\quad 0 < \phi < 1$$
Then:
The short-run multiplier can be calculated by taking the partial derivatives: $$\dfrac{\partial Y_t}{\partial X_t} = \beta_0$$ Which shows that an increase in $X$ of one unit has an immediate impact on $Y$ of $\beta_0$ units.
The long-run multiplier (or equilibrium multiplier is calculated as follows:
In order to calculate the long-run multiplier, we need to see how a one-unit increase in $X$ ar time $t$ affects $Y$ as $t$ increases.
This shows that after the first periods, the effect is decreasing if $|\phi| < 1$.
Imposing the so-called stability condition that $|phi|<1$ allows us to determine the long-run effect of a unit change in $X_t$: $$\beta_0 + (\phi \beta_0 + \beta_1) + \phi (\phi \beta_0 + \beta_1) + ... + \beta_0 + (1 + \phi + \phi^2 + ...)(\phi \beta_0 + \beta_1) = \dfrac{\beta_0 + \beta_1}{1 - \phi}$$
i.e. it is the sum of all the partial derivatives of $Y_{t+k}$ as $k \rightarrow \infty$.
This says that if $X_t$ increases by one unit for one moment, then the expected cumulative change (increase or decrease) in $Y_t$ is given by $(\beta_0 + \beta_1)/(1 - \phi)$.
Note that we can calculate the symbolic expressions of the partial derivatives in Python
similarly like in R
. More information can be found here. We present the first couple of partial derivative examples:
from sympy import *
Y_lag_1, X, X_lag_1, X_t_plus_1, X_t_plus_2 = symbols('Y_lag_1 X X_lag_1 X_t_plus_1 X_t_plus_2')
phi, beta_0, beta_1 = symbols('phi beta_0 beta_1')
#Define Y_t expression at time t:
expr_t0 = phi * Y_lag_1 + beta_0 * X + beta_1 * X_lag_1
expr_t1 = phi * expr_t0 + beta_0 * X_t_plus_1 + beta_1 * X
expr_t2 = phi * expr_t1 + beta_0 * X_t_plus_2 + beta_1 * X_t_plus_1
Note that if we print each expression, we see that the nested expressions in expr_t1
and expr_t2
are multiplied automatically in Python
:
print("Y_{t}: ", expr_t0)
print("Y_{t+1}: ", expr_t1)
print("Y_{t+2}: ", expr_t2)
Now we can calculate the partial derivatives:
#Partial derivative of Y_t at time t with respect to X_t:
pd_1 = diff(expr_t0, X)
print(pd_1)
#Partial derivative at time t+1
pd_2 = diff(expr_t1, X)
print(pd_2)
#Partial derivative at time t+2:
pd_3 = diff(expr_t2, X)
print(pd_3)
which is exactly the same that we got by calculating the derivatives manually.
We can also evaluate an expression for specific values (more examples here) with 2 digit precision:
print("1st partial deriv. when beta_0 = 1 is: ", pd_1.evalf(2, subs = {beta_0: 1}))
print("2nd partial deriv. when beta_0 = 1, beta_1 = 0.5 and phi = -0.1 is: ", pd_2.evalf(2, subs = {beta_0: 1.0, beta_1: 0.5, phi: -0.1}))
The following data contains the yearly observations on Puerto Rican employment rate, minimum wage and other variables.
txt1 = "http://uosis.mif.vu.lt/~rlapinskas/(data%20R&GRETL/prmin.data.xls"
#Read the data from the url:
puerto = pd.read_excel(txt1, header = None)
puerto.columns = ["year","avgmin","avgwage","kaitz","avgcov","covt","mfgwage", "prdef",
"prepop","prepopf","prgnp","prunemp","usgnp","tt","post74", "lprunemp",
"lprgnp","lusgnp","lkaitz","lprun_1","lprepop","lprep_1",
"mincov","lmincov","lavgmin"]
#Add a time index:
puerto.index = pd.date_range(start = "1950-01-01", periods = len(puerto.index), freq = "A").to_period()
#Example of different date formats:
print(puerto.index[:4])
puerto.head()
We will estimate the following model: $$\text{lprepop}_t = \beta_0 + \beta_1 \cdot \text{lmincov}_t + \beta_2 \cdot \text{lusgnp}_t + \epsilon_t, \quad \epsilon_t \sim WN(0, \sigma^2)$$ where:
fig = plt.figure()
puerto[['lusgnp', 'lmincov', 'lprepop']].plot(subplots=True, layout=(3,1))
plt.legend()
plt.tight_layout()
plt.show()
All the variables have a trend close to linear. Note that an accurate analysis would require to test for unit root first!
puerto_M1_est = smf.ols(formula = 'lprepop ~ 1 + lmincov + lusgnp', data = puerto)
puerto_M1_fit = puerto_M1_est.fit()
#Extract the estimated coefficient and p-value:
print(pd.DataFrame([puerto_M1_fit.params, puerto_M1_fit.pvalues], index = ["coef", "p-value"]).T)
Thus, if the minimum wage increases then employment declines which matches classical economics.
On the other hand, the GNP seems to be insignificant, but we shall test the claim right now by including the trend tt
:
puerto_M2_est = smf.ols(formula = 'lprepop ~ 1 + lmincov + lusgnp + tt', data = puerto)
puerto_M2_est = puerto_M2_est.fit()
print(pd.DataFrame([puerto_M2_est.params, puerto_M2_est.pvalues], index = ["coef", "p-value"]).T)
Here, we interpret the coefficient of lusgnp
as follows: if usgnp
rises by 1% above its long-run trend, then prepop
will rise an extra 1.057% above its long-run trend.
Note that the above regression is equivalent to this (rewritten initial equation in the form of deviations from the trend):
#Get the residuals of the DE-TRENDED series::
res_1 = smf.ols(formula = 'lprepop ~ 1 + tt', data = puerto).fit().resid
res_2 = smf.ols(formula = 'lmincov ~ 1 + tt', data = puerto).fit().resid
res_3 = smf.ols(formula = 'lusgnp ~ 1 + tt', data = puerto).fit().resid
res_data = pd.DataFrame([res_1, res_2, res_3], index = ["lprepop_detrend", "lmincov_detrend", "lusgnp_detrend"]).T
print("\n", res_data.head(), "\n\n")
puerto_detrended = smf.ols(formula = 'lprepop_detrend ~ 1 + lmincov_detrend + lusgnp_detrend', data = res_data).fit()
print(pd.DataFrame([puerto_detrended.params, puerto_detrended.pvalues], index = ["coef", "p-value"]).T)
Note that we already removed the Intercept
in the trend regressions so not only is it insignificant, but the coefficient is very small, so let's remove it from this regression:
puerto_detrended = smf.ols(formula = 'lprepop_detrend ~ -1 + lmincov_detrend + lusgnp_detrend', data = res_data).fit()
results = pd.DataFrame([puerto_detrended.params, puerto_detrended.pvalues], index = ["coef", "p-value"]).T
print(np.round(results, 4))
We can see that accounting for a time effect determines the significance of the exogenous variables in our model.
Let us analyze daily IBM stock prices spanning May 17, 1961 to November 2, 1962 (369 days in all) and daily closing prices of German DAX index starting at the 130th day of 1991.
OPTION 1: LOAD THE DATASET FROM A PACKAGE IN R:
We can install rpy2
- a package to use R
inside of Python
. More information here. This allows us to load datasets from R
in Python
. Open the Anaconda Navigator -> Environments. Click the arrow on base (root)
and select Open Terminal
and type the following:
conda install -c r rpy2
conda install -c r r-urca
If prompted to proceed, type y
to install or update the required packages.
Note that if we have more than one installation of R
(some versions of Anaconda have a separate path to their own version of R
), then we need to specify the location of the packages for the version of R
that we want to use.
Note that the newest version of R packages are located in:
C:\Users\[Your PC Username Here]\Documents\R\win-library\
Then the folder with the highest version number is the package folder of the newest R
version. If needed, an older folder can also be specified. In our case, the packages are located in:
from rpy2.robjects.packages import importr
waveslim = importr('waveslim', lib_loc = "C:/Users/Andrius/Documents/R/win-library/3.4")
Next, we need to load the ibm
data as we would do in R
with data(ibm)
:
utils = importr("utils")
utils.data("ibm")
Finally, we can transform the R
object into an array in Python
:
from rpy2.robjects import r, pandas2ri
ibm_data = pandas2ri.ri2py(r["ibm"])
ibm_data = pd.Series(ibm_data)
print(ibm_data.head())
dax_data = pd.read_csv("https://vincentarelbundock.github.io/Rdatasets/csv/datasets/EuStockMarkets.csv")
dax_data = dax_data["DAX"][:369]
dax_data = pd.Series(dax_data, name = "DAX Index Price")
dax_data.head()
OPTION2: if the data does not load from R
, or if you do not have R
installed, use the following code to load the data:
ibm_data = [460., 457., 452., 459., 462., 459., 463., 479., 493., 490., 492.,
498., 499., 497., 496., 490., 489., 478., 487., 491., 487., 482.,
479., 478., 479., 477., 479., 475., 479., 476., 476., 478., 479.,
477., 476., 475., 475., 473., 474., 474., 474., 465., 466., 467.,
471., 471., 467., 473., 481., 488., 490., 489., 489., 485., 491.,
492., 494., 499., 498., 500., 497., 494., 495., 500., 504., 513.,
511., 514., 510., 509., 515., 519., 523., 519., 523., 531., 547.,
551., 547., 541., 545., 549., 545., 549., 547., 543., 540., 539.,
532., 517., 527., 540., 542., 538., 541., 541., 547., 553., 559.,
557., 557., 560., 571., 571., 569., 575., 580., 584., 585., 590.,
599., 603., 599., 596., 585., 587., 585., 581., 583., 592., 592.,
596., 596., 595., 598., 598., 595., 595., 592., 588., 582., 576.,
578., 589., 585., 580., 579., 584., 581., 581., 577., 577., 578.,
580., 586., 583., 581., 576., 571., 575., 575., 573., 577., 582.,
584., 579., 572., 577., 571., 560., 549., 556., 557., 563., 564.,
567., 561., 559., 553., 553., 553., 547., 550., 544., 541., 532.,
525., 542., 555., 558., 551., 551., 552., 553., 557., 557., 548.,
547., 545., 545., 539., 539., 535., 537., 535., 536., 537., 543.,
548., 546., 547., 548., 549., 553., 553., 552., 551., 550., 553.,
554., 551., 551., 545., 547., 547., 537., 539., 538., 533., 525.,
513., 510., 521., 521., 521., 523., 516., 511., 518., 517., 520.,
519., 519., 519., 518., 513., 499., 485., 454., 462., 473., 482.,
486., 475., 459., 451., 453., 446., 455., 452., 457., 449., 450.,
435., 415., 398., 399., 361., 383., 393., 385., 360., 364., 365.,
370., 374., 359., 335., 323., 306., 333., 330., 336., 328., 316.,
320., 332., 320., 333., 344., 339., 350., 351., 350., 345., 350.,
359., 375., 379., 376., 382., 370., 365., 367., 372., 373., 363.,
371., 369., 376., 387., 387., 376., 385., 385., 380., 373., 382.,
377., 376., 379., 386., 387., 386., 389., 394., 393., 409., 411.,
409., 408., 393., 391., 388., 396., 387., 383., 388., 382., 384.,
382., 383., 383., 388., 395., 392., 386., 383., 377., 364., 369.,
355., 350., 353., 340., 350., 349., 358., 360., 360., 366., 359.,
356., 355., 367., 357., 361., 355., 348., 343., 330., 340., 339.,
331., 345., 352., 346., 352., 357.]
ibm_data = pd.Series(ibm_data, name = "IBM Stock Price")
stock_data = pd.DataFrame([ibm_data, dax_data], index = ["IBM", "DAX"]).T
plt.figure()
stock_data.plot(subplots=True, layout=(2,1))
plt.legend()
plt.show()
Lets create a static model by regressing the price of IBM stocks on the price of DAX index:
ID_est = smf.ols(formula = 'IBM ~ 1 + DAX', data = stock_data)
ID_fit = ID_est.fit()
print(np.round(pd.DataFrame([ID_fit.params, ID_fit.pvalues], index = ["coef", "p-value"]).T, 4))
Though, because of their nature, IBM and DAX should not have any relationship, the coefficient of regression is very significant. This is an example of a spurious regression which can be explained through the fact that the errors of the model have a unit root.
Note: to have a similar formula functionality to the one in R
's dynlm
package, lets define all the functions that we will need:
#Lag operator:
def L(x, n = 1):
if n == 0:
return x
if isinstance(x, pd.Series):
return x.shift(n)
else:
x = pd.Series(x)
return x.shift(n)
#Difference operator:
def d(x):
if isinstance(x, pd.Series):
return x.diff()
else:
x = pd.Series(x)
return x.diff()
#Trend/time operator:
def time(x):
return list(range(1, len(x) + 1))
We can test the functions to verify that they work correctly:
test_series = pd.Series(list(range(3, 23)))
print("Series: ", test_series.tolist())
print("lagged series: ", L(test_series, 1).tolist())
print("diff. series: ", d(test_series).tolist())
print("trend: ", time(test_series))
The following should be carried out:
IBM
has a unit root;DAX
has a unit root;ID
model have a unit root.The (1) and (2) steps can be carried out in the same manner as in the previous lectures via the Augmented Dickey-Fuller unit root test, using the automated lag order selection procedure according to the AIC
value.
Note that for the regression
value the possible values are {‘c’,’ct’,’ctt’,’nc’}
- constant and trend order to include in regression
c
: constant only (default);ct
: constant and trend;ctt
: constant, and linear and quadratic trend;nc
: no constant, no trend;IBM_adf = smt.stattools.adfuller(stock_data["IBM"], maxlag = 5, autolag = 'AIC', regression = 'c', regresults = True)
IBM_adf
We see that:
- The 1st element is the test statistic of the ADF test;
- The 2nd element is the p-value associated with the test statistic;
- The 3rd element is a dictionary of the critical values for the test statistic at the 1 %, 5 %, and 10 % levels;
- The 4th element is the additional regression results returned from the test;
#Get the regression results:
ibm_adf_mdl = IBM_adf[3]
We can look at the possible attributes which we can access in the variable:
dir(IBM_adf[3])
We can print the hypothesis of the test:
print("H0: ", ibm_adf_mdl.H0)
print("H1: ", ibm_adf_mdl.HA)
Note that the use of the word level
, which indicates that the coefficient if for the variable $Y_t$ and not $\Delta Y_t$
print("Number of Delta Y_t lags in the ADF regression : ", ibm_adf_mdl.usedlag)
print("ADF test statistic: ", ibm_adf_mdl.adfstat)
print("ADF critical values: ", ibm_adf_mdl.critvalues)
Because the ADF test-statistic is greater than the 5% critical value, we do not reject the null hypothesis that the time series has a unit root, meaning it is non-stationary (i.e. that it is DS).
The same can be said if we look at the p-value:
np.round(IBM_adf[1], 4)
Becasue p-value > 0.05
, we do not reject the null hypothesis, which means that IBM time series is a DS process (or, equivalently, an integrated series).
We can also look at the regression carried out via OLS:
print(ibm_adf_mdl.resols.summary())
Note that:
x1
- because $\text{t-statistic} = -0.343$, i.e. x1
is the $\rho$ coefficient when testing the null hypothesis for unit root, i.e. $H_0: \rho = 0$, or, equivalently (using the definition of $\rho$), $H_0: \phi_1 + ... + \phi_{p+1} = 1$;x2
- because ibm_adf_mdl.usedlag
is equal to 1, x2
is the $\gamma_1$ coefficient for $\Delta Y_{t-1}$;
So, we confirmed, that IBM has a unit root. We can do the same for DAX (here we present only the nelevant results):
DAX_adf = smt.stattools.adfuller(stock_data["DAX"], maxlag = 5, autolag = 'AIC', regression = 'c', regresults = True)
print("ADF test statistic", DAX_adf[0])
print("ADF test p-value: ", DAX_adf[1])
print("ADF critical values: ", DAX_adf[2])
Since the p-value
is greater than 0.05, we do not reject the null hypothesis of a unit root.
Note, that the hypothesis and the selected lag order can be extracted as before:
print("H0: ", DAX_adf[3].H0)
print("H1: ", DAX_adf[3].HA)
print("Lag order : ", DAX_adf[3].usedlag)
Regarding step (3):
ID_res = pd.DataFrame([ID_fit.resid], index = ["ID_res"]).T
ID_res_est = smf.ols(formula = 'd(ID_res) ~ 1 + L(ID_res, 1) + time(ID_res)', data = ID_res)
ID_res_fit = ID_res_est.fit()
print(np.round(pd.DataFrame([ID_res_fit.params, ID_res_fit.pvalues], index = ["coef", "p-value"]).T, 4))
Remember, that we have previously defined d()
, L()
and time()
functions, which lets us use them in the formula expression.
Recall that the unit root hypothesis $H_0$ claims that the coefficient at $\text{L(ID_res, 1)}$ equals zero. In this case of testing for a unit root in errors, we have a further complication: here the t-statistics has not the Dickey-Fuller, but the Engle-Granger distribution whose critical values are given in this table:
No. of variables (included Y) |
1% |
Significance level 5% |
10% |
---|---|---|---|
2 | -3.90 | -3.34 | -3.04 |
3 | -4.29 | -3.74 | -3.45 |
4 | -4.64 | -4.10 | -3.81 |
5 | -4.96 | -4.42 | -4.13 |
In order to test the cointegration of these two variables note that the t-statistics equals $-1.884$ which is closer to $0$ than $-3.34$, therefore there is no ground to reject $H_0$.
Thus, the errors have a unit root or, in other words, $IBM$ and $DAX$ are not cointegrated and the strong regression bonds are only spurious.
It is well known that many economic time series are $DS$ and therefore the regression for levels is often misleading (the standard Student or Wald tests provide wrong p-values). On the other hand, if these $DS$ series are cointegrated, the OLS method is OK.
Recall that if $Y$ and $X$ have unit roots (i.e. are non-stationary), but some linear combination $Y_t - \alpha - \beta X_t$ is (Trend) Stationary, then we say that $Y$ and $X$ are cointegrated.
In order to establish cointegration, we estimate the unknown coefficients $\alpha$ and $\beta$ via OLS and then test whether the errors of the model have a unit root: $$ \begin{aligned} Y_t &= \alpha + \beta X_t\ (+ \delta t) + \epsilon_t \\ &\Downarrow_{OLS}\\ \widehat{e}_t &= Y_t - \widehat{\alpha} - \widehat{\beta} X_t\ (- \widehat{\delta} t ) \end{aligned} $$
The example dataset contains quarterly data 1947:Q1 - 1989:Q3
, where:
lc
- logarithms of the real quarterly personal expenditure (i.e. consumption);ly
- logarithms of the real quarterly aggregated disposable income'.Note: to use different frequencies see here.
url = "http://uosis.mif.vu.lt/~rlapinskas/(data%20R&GRETL/hamilton.txt"
ham = pd.read_csv(url, sep = " ")
ham.index = pd.date_range(start = ham["quarter"][0], periods = len(ham.index), freq = "Q").to_period()
ham = ham[["lc", "ly"]]
plt.figure()
ham.plot(subplots = False)
plt.legend()
plt.show()
It seems that both time series may have a trend, as such, our cointegration equation will include a time trend included as well, i.e.: $$Y_t = \alpha + \beta X_t\ + \delta t + \epsilon_t$$
coint_eq = smf.ols(formula = 'lc ~ 1 + ly + time(ly)', data = ham).fit()
print(np.round(pd.DataFrame([coint_eq.params, coint_eq.pvalues], index = ["coef", "p-value"]).T, 4))
plt.figure(figsize = (10, 5))
plt.scatter(ham["ly"], ham["lc"])
plt.plot(ham["ly"], coint_eq.fittedvalues, color = "red")
plt.xlabel("ly")
plt.ylabel("lc")
plt.show()
plt.figure(figsize = (10, 5))
coint_eq.resid.plot()
plt.hlines(y = 0, xmin = coint_eq.resid.index[0], xmax = coint_eq.resid.index[-1], color = "red")
plt.show()
It is easy to verify that both time series have unit roots and their final models are (note that we are including the trend coefficient as well) via Procedure C
from the previous Lecture:
mod_ly = smt.stattools.adfuller(ham["ly"], maxlag = 5, autolag = 'AIC', regression = 'ct', regresults = True)
print("Lag order selected: ", mod_ly[3].usedlag)
print("ADF test statistic", mod_ly[0])
print("ADF test p-value: ", mod_ly[1])
print("ADF critical values: ", mod_ly[2])
mod_lc = smt.stattools.adfuller(ham["lc"], maxlag = 5, autolag = 'AIC', regression = 'ct', regresults = True)
print("Lag order selected: ", mod_lc[3].usedlag)
print("ADF test statistic", mod_lc[0])
print("ADF test p-value: ", mod_lc[1])
print("ADF critical values: ", mod_lc[2])
We see that in both cases, the associalted p-values
are greater than 0.05, so we do not reject the null hypothesis of a unit root for each of the time series.
We can also do the same thing manually (remember Procedure A
from the previous Lectures):
#Initial model for unit root test:
adf_lc = smf.ols(formula = 'd(lc) ~ 1 + L(lc, 1) + time(lc) + L(d(lc), 1) + L(d(lc), 2) + L(d(lc), 3) + L(d(lc), 4)', data = ham).fit()
print(np.round(pd.DataFrame([adf_lc.params, adf_lc.pvalues], index = ["coef", "p-value"]).T, 4), "\n\n -------------- \n")
#Now drop the last lag if it is insignificant:
adf_lc = smf.ols(formula = 'd(lc) ~ 1 + L(lc, 1) + time(lc) + L(d(lc), 1) + L(d(lc), 2) + L(d(lc), 3)', data = ham).fit()
print(np.round(pd.DataFrame([adf_lc.params, adf_lc.pvalues], index = ["coef", "p-value"]).T, 4), "\n\n -------------- \n")
#Drop the 3rd lag as variable as it is insignificant:
adf_lc = smf.ols(formula = 'd(lc) ~ 1 + L(lc, 1) + time(lc) + L(d(lc), 1) + L(d(lc), 2)', data = ham).fit()
print(np.round(pd.DataFrame([adf_lc.params, adf_lc.pvalues], index = ["coef", "p-value"]).T, 4), "\n\n -------------- \n")
We see that the largest lag $\Delta Y_{t-2}$ coefficient is significant, so our lag order is 2
. The time coefficient time(lc)
is also significant.
Looking at the t-statistic of L(lc, 1)
:
print(adf_lc.tvalues)
The value is -2.391784 > −3.45
, so we do not reject the null hypothesis of a unit root. The same can be shown for the variable ly
as well (verify this!).
Now, we want to test whether the errors of the model: $$\text{lc}_t = \alpha + \beta \cdot \text{ly}_t + \gamma t + \epsilon_t$$
make a stationary process, i.e. do not have a unit root. We will try two different cointegration models - one with a trend and one without:
cy_res_no_trend = pd.DataFrame([smf.ols(formula = 'lc ~ 1 + ly', data = ham).fit().resid], index = ["res"]).T
cy_res_trend = pd.DataFrame([smf.ols(formula = 'lc ~ 1 + ly + time(lc)', data = ham).fit().resid], index = ["res"]).T
We want to test whether $|\phi| < 1$ from the model: $$\text{cy_res}_t = \phi \cdot \text{cy_res}_{t-1} + \epsilon_t$$
However, if the errors of this model do not constitute WN, the standard OLS procedure gives the erroneous (i.e. incorrect) estimate of $\phi$. Then:
Phillips-Outliaris test
.This method is also called the Engle-Granger method, it tests for the unit root in errors hypothesis. However, the OLS procedure proposes the coefficients to the cointegration equation such that the variance of the residuals is minimum, thus the residuals are too stationary. In other words, the null hypothesis for a unit root will be rejected too often.
This is why we use other critical values to test the null hypothesis $H_0: \rho - 0$ for $\Delta e_y = \rho e_{t-1} + \gamma_1 \Delta e_{t-1} + ... + \gamma_{p-1} \Delta e_{t-p+1} + w_t$.
The asymptotic 5% critical values for residual unit root tests for cointegration in: $$Y_t = \alpha + \beta X_t + \epsilon_t$$
No. of X's in the right-hand-side of eq. |
No deterministic terms in eq. |
Only constant in eq. |
Constant and Trend in eq. |
---|---|---|---|
1 | -2.76 | -3.37 | -3.80 |
2 | -3.27 | -3.77 | -4.16 |
3 | -3.74 | -4.11 | -4.49 |
4 | -4.12 | -4.45 | -4.74 |
5 | -4.40 | -4.71 | -5.03 |
we shall perform the test manually. Here is the final model of the residuals is:
res_coint_no_trend_eq = smf.ols(formula = 'd(res) ~ -1 + L(res, 1) + L(d(res), 1) + L(d(res), 2) + L(d(res), 3) + L(d(res), 4)', data = cy_res_no_trend).fit()
print(np.round(pd.DataFrame([res_coint_no_trend_eq.params, res_coint_no_trend_eq.pvalues], index = ["coef", "p-value"]).T, 4), "\n-----\n")
print(res_coint_no_trend_eq.tvalues)
In this case, we have that since -2.752
is closer to zero than -3.37
, we do not reject the unit root in errors null hypothesis, thus the processes lc
and ly
are not cointegrated.
res_coint_trend_eq =smf.ols(formula = 'd(res) ~ -1 + L(res, 1) + L(d(res), 1)', data = cy_res_trend).fit()
print(np.round(pd.DataFrame([res_coint_trend_eq.params, res_coint_trend_eq.pvalues], index = ["coef", "p-value"]).T, 4), "\n-----\n")
print(res_coint_trend_eq.tvalues)
On the other hand, if we include a trend component, then our t-statistic -4.5368 < -3.80
. In this
case, we reject the null hypothesis, i.e. we would get that lc
and ly
are cointegrated!
This highlights one drawback of the Engle-Granger procedure - Engle-Granger test does not consistently indicate cointegration.
One possible explanation is that the Engle-Granger and Dickey-Fuller tests are known to have low power.
The Phillips-Ouliaris cointegration test: $$H_0: \text{x and y are not cointegrated}$$
Note that Python
does not have an implementation of this test. This can be somewhat remedied by passing instructions to R
from Python
:
import rpy2.robjects as ro
#Append our actual library directory to the current one:
ro.r('.libPaths(c(.libPaths(),"C:/Users/Andrius/Documents/R/win-library/3.4"))')
print(ro.r('.libPaths()'))
ro.globalenv['lc'] = ro.FloatVector(ham["lc"])
ro.globalenv['ly'] = ro.FloatVector(ham["ly"])
ro.globalenv['cy_data'] = ro.r('cbind(lc,ly)')
ro.r('library(urca)')
ro.globalenv['cy_po'] = ro.r('ca.po(cy_data, demean = "trend", type = "Pz")')
print(ro.r('summary(cy_po)'))
The test type = "Pz", compared to the test type = "Pu", has the advantage that it is invariant to the normalization of the cointegration vector, i.e. it does not matter which variable is on the left hand side of the equation.
#More compactly - use `capture.output` and print the last 6 lines of the summary:
ro.globalenv['cy_po_summary'] = ro.r('capture.output(summary(cy_po))')
print(ro.r('cy_po_summary[(length(cy_po_summary) - 6):length(cy_po_summary)]'))
The test statistics 49.341
is closer to 0
than 81.3812
(alternatively, 49.341 < 81.3812
), therefore
we do not reject the null hypothesis and conclude that lc and ly are not cointegrated.
If $X$ and $Y$ are integrated but not cointegrated, their ties (in levels) can only be spurious.
In order to get sensible results, the model may be augmented with new variables or another $ADL$ model (for stationary differences) may be created, like: $$\Delta Y_t = \phi + \gamma_1 \Delta Y_{t-1} + ... + \gamma_p \Delta Y_{t-p} + \omega_0 \Delta X_t + ... + \omega_q \Delta X_q + \epsilon_t$$
Similar model can also be analyzed in the case of cointegration, but then one more term, the so-called error correction term $\widehat{u}_{t-1}$ (where $Y_t = \alpha + \beta X_t + u_t$ is the long-run equilibrium, or cointegration, equation) should be included: $$\Delta Y_t = \phi + \delta t + \lambda \widehat{u}_{t-1} + \gamma_1 \Delta Y_{t-1} + ... + \gamma_p \Delta Y_{t-p} + \omega_0 \Delta X_t + ... + \omega_q \Delta X_q + \epsilon_t$$
This is the Error Correction Model (ECM), where the coefficient $\lambda$ must be negative. E.g., if $\lambda = - 0.25$, then a unit deviation from the equilibrium will be compensated in $1/0.25 = 4$ time units.
The ECM is usually created in two steps:
Let's say we have a dataset where:
Suppose further, that we found that $lnQ$ and $ln(X/G)$ are cointegrated (this should be tested!).
url = "http://uosis.mif.vu.lt/~rlapinskas/(data%20R&GRETL/data1.txt"
DT1 = pd.read_csv(url, sep = "\t")
#Fix an error when reading the file
if pd.isna(DT1.iloc[17,2]):
DT1.iloc[17, 2] = DT1.iloc[17, 3]
DT1.iloc[17, 3] = DT1.iloc[17, 4]
DT1.iloc[17, 4] = DT1.iloc[17, 5]
DT1.iloc[17, 5] = DT1.iloc[17, 6]
DT1 = DT1.iloc[:,0:5]
DT1.columns = ["Year", "NF", "RF", "NTE", "RTE"]
DT1.index = pd.date_range(start = "1963-01-01", periods = len(DT1.index), freq = "A").to_period()
X = DT1["NTE"]
Q = DT1["RF"]
P = DT1["NF"] / DT1["RF"]
G = DT1["NTE"]/ DT1["RTE"]
DT1 = pd.DataFrame([X, Q, P, G], index = ["X", "Q", "P", "G"]).T
plt.figure()
DT1.plot(subplots = True, layout = (2, 2))
plt.tight_layout()
plt.show()
Let's specify the cointegration (equilibrium) equation:
ci = smf.ols(formula = 'np.log(Q) ~ 1 + np.log(X/G)', data = DT1).fit()
print(np.round(pd.DataFrame([ci.params, ci.tvalues, ci.pvalues], index = ["coef", "t-value", "p-value"]).T, 4))
ci_res = ci.resid
We use the residuals of the cointegration equation c1
and create a regression model:
$$\Delta \text{log(Q)}_t = \lambda \cdot \text{cl_res}_{t-1} + \gamma_1 \Delta \text{log(Q)}_{t-1} + \omega_0 \Delta \text{log(X/G)}_{t} + \omega_1 \Delta \text{log(X/G)}_{t-1} + \epsilon_t$$
#Add the residuals to the data.frame:
DT1_res = pd.concat([DT1.reset_index(drop = True), ci_res.reset_index(drop = True)], axis = 1)
DT1_res.columns = ['X', 'Q', 'P', 'G', 'ci_res']
#Create the cointegration model:
model_1 = smf.ols(formula = 'd(np.log(Q)) ~ -1 + L(ci_res) + L(d(np.log(Q)), 1) + d(np.log(X/G)) + L(d(np.log(X/G)), 1)', data = DT1_res).fit()
print(np.round(pd.DataFrame([model_1.params, model_1.tvalues, model_1.pvalues], index = ["coef", "t-value", "p-value"]).T, 4))
The error correction term has the right sign and is significant (coef. of L(ci_res)
has a p-value = 0.0018 < 0.05
), the model itself is quite acceptable.
We can still improve the model by including other differences of $I(1)$ variables, e.g. log-differences of relative food prices $P/G$: $$\Delta \text{log(Q)}_t = \lambda \cdot \text{cl_res}_{t-1} + \gamma_1 \Delta \text{log(Q)}_{t-1} + \omega_0 \Delta \text{log(X/G)}_{t} + \omega_1 \Delta \text{log(X/G)}_{t-1} + \widetilde{\omega}_0 \Delta \text{log(P/G)}_{t} + \widetilde{\omega}_1 \Delta \text{log(P/G)}_{t-1} + \epsilon_t$$
model_2 = smf.ols(formula = 'd(np.log(Q)) ~ -1 + L(ci_res) + L(d(np.log(Q)), 1) + d(np.log(X/G)) + L(d(np.log(X/G)), 1) + d(np.log(P/G)) + L(d(np.log(P/G)), 1)', data = DT1_res).fit()
print(np.round(pd.DataFrame([model_2.params, model_2.tvalues, model_2.pvalues], index = ["coef", "t-value", "p-value"]).T, 4))
After deleting insignificant terms (here we use an arbitraty 10%, instead of a 5% significance level), we get such an ECM:
model_2 = smf.ols(formula = 'd(np.log(Q)) ~ -1 + L(ci_res) + L(d(np.log(Q)), 1) + d(np.log(X/G)) + L(d(np.log(X/G)), 1) + d(np.log(P/G))', data = DT1_res).fit()
print(np.round(pd.DataFrame([model_2.params, model_2.tvalues, model_2.pvalues], index = ["coef", "t-value", "p-value"]).T, 4), "\n-------\n")
model_2 = smf.ols(formula = 'd(np.log(Q)) ~ -1 + L(ci_res) + L(d(np.log(Q)), 1) + d(np.log(X/G)) + d(np.log(P/G))', data = DT1_res).fit()
print(np.round(pd.DataFrame([model_2.params, model_2.tvalues, model_2.pvalues], index = ["coef", "t-value", "p-value"]).T, 4), "\n-------\n")
Here the short-run demand elasticities with respect to total real expenditure and relative price are 0.3658
and -0.3236
respectively.
In the long-run, however, demand is independent of relative price, while the expenditure elasticity falls sligthly to the 0.3554
in the equilibrium relationship (see the model ci
above).
We also note, that the residuals are WN:
fig = plt.figure(figsize = (10, 5))
model_2.resid.plot(ax = fig.add_subplot(211))
sm.graphics.tsa.plot_acf(model_2.resid, lags = 20, zero = False, ax = fig.add_subplot(223))
sm.graphics.tsa.plot_pacf(model_2.resid, lags = 20, zero = False, ax = fig.add_subplot(224))
plt.tight_layout()
plt.show()