Regression with Time Lags: Autoregressive Distributed Lag (ADL) Models

We shall re-do the example from the lecture slides.

We have data collected on a monthly basis over five years (i.e. 60 months) on the following variables:

  • $Y$ - market capitalization of Company B (in '000 's of dollars);
  • $X$ - the price of oil (dollars per barrel) above a benchmark (or threshold) price.
In [60]:
#Import the required modules for data generation
import numpy as np

#Import the required modules for plot creation:
import matplotlib.pyplot as plt

#Import the required modules for DataFrame creation:
import pandas as pd

#Import the required modules for TimeSeries data generation:
import statsmodels.api as sm
#Import the required modules for test statistic calculation:
import statsmodels.stats as sm_stat
#Import the required modules for model estimation:
import statsmodels.tsa as smt
#Import the required modules for optimization:
import scipy.optimize as optimize

#We also need additional data:
import statsmodels.formula.api as smf
In [2]:
txt1 = "http://uosis.mif.vu.lt/~rlapinskas/(data%20R&GRETL/badnews.xls"
#Read the data from the url:
BADNEWS = pd.read_excel(txt1)
#Add a time index:
BADNEWS.index = pd.date_range(start = "1990-01", periods = len(BADNEWS.index), freq = "M").to_period()
#Example of different date formats:
print(BADNEWS.index[:4])
BADNEWS.index = BADNEWS.index.to_timestamp()
print(BADNEWS.index[:4])
BADNEWS.columns = ["capital", "price"]
PeriodIndex(['1990-01', '1990-02', '1990-03', '1990-04'], dtype='period[M]', freq='M')
DatetimeIndex(['1990-01-01', '1990-02-01', '1990-03-01', '1990-04-01'], dtype='datetime64[ns]', freq='MS')
In [3]:
BADNEWS.head()
Out[3]:
capital price
1990-01-01 75401.92 5.4301
1990-02-01 70819.91 0.0000
1990-03-01 81087.10 16.3872
1990-04-01 79011.16 0.0000
1990-05-01 76432.67 0.0000
In [4]:
fig = plt.figure()
BADNEWS.plot(subplots=True, layout=(2,1))
plt.legend()
plt.tight_layout()
plt.show()
<matplotlib.figure.Figure at 0x1ac2f8bf550>

Since this is time series data, it is likely that previous month's news about oil price (i.e. the price of oil at time $t-1$) will affect the current market capitalization. As such, it is necessary to include lags of $X$ in the regression.

Below we present OLS estimates of the coefficients on a distributed lag model in which market capitalization is allowed to depend on present news about the oil price and news up to $q_{max} = 4$ months ago. That it: $$\text{capit}_{t} = \alpha + \beta_0 \cdot \text{price}_t + \beta_1 \cdot \text{price}_{t-1} + ... + \beta_4 \cdot \text{price}_{t-4} + \epsilon_t,\quad \epsilon_t \sim WN(0, \sigma^2)$$

In [5]:
def lag(x, n):
    if n == 0:
        return x
    if isinstance(x, pd.Series):
        return x.shift(n) 
    else:
        x = pd.Series(x)
        return x.shift(n) 

    x = x.copy()
    x[n:] = x[0:-n]
    x[:n] = np.nan
    return x
In [6]:
mod_L4_est = smf.ols(formula = 'capital ~ 1 + lag(price, 0) + lag(price, 1) + lag(price, 2) + lag(price, 3) + lag(price, 4)', data = BADNEWS)
mod_L4_fit = mod_L4_est.fit()
print(mod_L4_fit.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                capital   R-squared:                       0.760
Model:                            OLS   Adj. R-squared:                  0.736
Method:                 Least Squares   F-statistic:                     31.64
Date:                Mon, 09 Apr 2018   Prob (F-statistic):           2.22e-14
Time:                        13:19:25   Log-Likelihood:                -557.36
No. Observations:                  56   AIC:                             1127.
Df Residuals:                      50   BIC:                             1139.
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
=================================================================================
                    coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------
Intercept      9.117e+04   1949.850     46.759      0.000    8.73e+04    9.51e+04
lag(price, 0)  -131.9943     47.436     -2.783      0.008    -227.272     -36.716
lag(price, 1)  -449.8597     47.557     -9.459      0.000    -545.380    -354.339
lag(price, 2)  -422.5183     46.778     -9.032      0.000    -516.474    -328.562
lag(price, 3)  -187.1041     47.641     -3.927      0.000    -282.794     -91.415
lag(price, 4)   -27.7710     47.662     -0.583      0.563    -123.503      67.961
==============================================================================
Omnibus:                        6.115   Durbin-Watson:                   2.241
Prob(Omnibus):                  0.047   Jarque-Bera (JB):                5.454
Skew:                          -0.755   Prob(JB):                       0.0654
Kurtosis:                       3.238   Cond. No.                         90.9
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Since the p-value corresponding to the explanatory variable $\text{price}_{t-4}$ is greater than 0.05, we cannot reject the null hypothesis that $\beta_4 = 0$ at the 5% level of significance. Accordingly, we drop this variable from the model and re-estimate the model with lag length equal to 3, yielding the following results:

In [7]:
mod_L3_est = smf.ols(formula = 'capital ~ 1 + lag(price, 0) + lag(price, 1) + lag(price, 2) + lag(price, 3)', data = BADNEWS)
mod_L3_fit = mod_L3_est.fit()
print(mod_L3_fit.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                capital   R-squared:                       0.757
Model:                            OLS   Adj. R-squared:                  0.739
Method:                 Least Squares   F-statistic:                     40.60
Date:                Mon, 09 Apr 2018   Prob (F-statistic):           2.09e-15
Time:                        13:19:25   Log-Likelihood:                -567.20
No. Observations:                  57   AIC:                             1144.
Df Residuals:                      52   BIC:                             1155.
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
=================================================================================
                    coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------
Intercept       9.04e+04   1643.183     55.017      0.000    8.71e+04    9.37e+04
lag(price, 0)  -125.9000     46.240     -2.723      0.009    -218.688     -33.112
lag(price, 1)  -443.4918     45.882     -9.666      0.000    -535.560    -351.424
lag(price, 2)  -417.6089     45.733     -9.131      0.000    -509.379    -325.838
lag(price, 3)  -179.9043     46.252     -3.890      0.000    -272.716     -87.093
==============================================================================
Omnibus:                        5.268   Durbin-Watson:                   2.235
Prob(Omnibus):                  0.072   Jarque-Bera (JB):                4.525
Skew:                          -0.679   Prob(JB):                        0.104
Kurtosis:                       3.244   Cond. No.                         71.9
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In [8]:
print(np.round(mod_L3_fit.params, 4))
print(np.round(mod_L3_fit.pvalues, 4))
Intercept        90402.2210
lag(price, 0)     -125.9000
lag(price, 1)     -443.4918
lag(price, 2)     -417.6089
lag(price, 3)     -179.9043
dtype: float64
Intercept        0.0000
lag(price, 0)    0.0088
lag(price, 1)    0.0000
lag(price, 2)    0.0000
lag(price, 3)    0.0003
dtype: float64

The p-value for testing $\beta_3 = 0$ is 0.0003 < 0.05. We therefore conclude that the variable $\text{price}_{t-3}$ does indeed belong in the distributed lag model. Hence, $q = 3$ is the lag length we select for this model.

In a formal report, we would present this table of results, or the equation: $$\text{capit}_{t} = 90402.2210 -125.9000 \cdot \text{price}_t - 443.4918 \cdot \text{price}_{t-1} -417.6089 \cdot \text{price}_{t-2} -179.9043 \cdot \text{price}_{t-4}$$

The interpretation is similar to the one presented in the lecture slides - increasing the oil price by 1 dollar per barrel in a given month $t$ is associated with:

  • An immediate reduction in market capitalization of \$ 125 900, ceteris paribus;
  • A further reduction in market capitalization of of \$ 443 491 one month later, ceteris paribus;



Short-run and Long-run Multipliers

In general, consideri the following model: $$Y_t = \alpha + \phi Y_{t-1} + \beta_0 X_t + \beta_1 X_{t-1} + \epsilon_t,\quad 0 < \phi < 1$$

Then:

  • The short-run multiplier can be calculated by taking the partial derivatives: $$\dfrac{\partial Y_t}{\partial X_t} = \beta_0$$ Which shows that an increase in $X$ of one unit has an immediate impact on $Y$ of $\beta_0$ units.

  • The long-run multiplier (or equilibrium multiplier is calculated as follows:
    In order to calculate the long-run multiplier, we need to see how a one-unit increase in $X$ ar time $t$ affects $Y$ as $t$ increases.

    • The effect after one period is: $$\dfrac{\partial Y_{t+1}}{\partial X_t} = \phi \dfrac{\partial Y_t}{\partial X_t} + \beta_1 = \phi \beta_0 + \beta_1$$
    • similarly, after two periods the effect is: $$\dfrac{\partial Y_{t+2}}{\partial X_t} = \phi \dfrac{\partial Y_{t+1}}{\partial X_t} = \phi (\phi \beta_0 + \beta_1)$$
    • etc.

This shows that after the first periods, the effect is decreasing if $|\phi| < 1$.

Imposing the so-called stability condition that $|phi|<1$ allows us to determine the long-run effect of a unit change in $X_t$: $$\beta_0 + (\phi \beta_0 + \beta_1) + \phi (\phi \beta_0 + \beta_1) + ... + \beta_0 + (1 + \phi + \phi^2 + ...)(\phi \beta_0 + \beta_1) = \dfrac{\beta_0 + \beta_1}{1 - \phi}$$

i.e. it is the sum of all the partial derivatives of $Y_{t+k}$ as $k \rightarrow \infty$.

This says that if $X_t$ increases by one unit for one moment, then the expected cumulative change (increase or decrease) in $Y_t$ is given by $(\beta_0 + \beta_1)/(1 - \phi)$.

Note that we can calculate the symbolic expressions of the partial derivatives in Python similarly like in R. More information can be found here. We present the first couple of partial derivative examples:

In [9]:
from sympy import *

Y_lag_1, X, X_lag_1, X_t_plus_1, X_t_plus_2 = symbols('Y_lag_1 X X_lag_1 X_t_plus_1 X_t_plus_2')
phi, beta_0, beta_1 = symbols('phi beta_0 beta_1')
In [10]:
#Define Y_t expression at time t:
expr_t0 = phi * Y_lag_1 + beta_0 * X + beta_1 * X_lag_1
expr_t1 = phi * expr_t0 + beta_0 * X_t_plus_1 + beta_1 * X
expr_t2 = phi * expr_t1 + beta_0 * X_t_plus_2 + beta_1 * X_t_plus_1

Note that if we print each expression, we see that the nested expressions in expr_t1 and expr_t2 are multiplied automatically in Python:

In [11]:
print("Y_{t}:   ", expr_t0)
print("Y_{t+1}: ", expr_t1)
print("Y_{t+2}: ", expr_t2)
Y_{t}:    X*beta_0 + X_lag_1*beta_1 + Y_lag_1*phi
Y_{t+1}:  X*beta_1 + X_t_plus_1*beta_0 + phi*(X*beta_0 + X_lag_1*beta_1 + Y_lag_1*phi)
Y_{t+2}:  X_t_plus_1*beta_1 + X_t_plus_2*beta_0 + phi*(X*beta_1 + X_t_plus_1*beta_0 + phi*(X*beta_0 + X_lag_1*beta_1 + Y_lag_1*phi))

Now we can calculate the partial derivatives:

In [12]:
#Partial derivative of Y_t at time t with respect to X_t:
pd_1 = diff(expr_t0, X)
print(pd_1)
#Partial derivative at time t+1
pd_2 = diff(expr_t1, X)
print(pd_2)
#Partial derivative at time t+2:
pd_3 = diff(expr_t2, X)
print(pd_3)
beta_0
beta_0*phi + beta_1
phi*(beta_0*phi + beta_1)

which is exactly the same that we got by calculating the derivatives manually.

We can also evaluate an expression for specific values (more examples here) with 2 digit precision:

In [13]:
print("1st partial deriv. when beta_0 = 1 is: ", pd_1.evalf(2, subs = {beta_0: 1}))
print("2nd partial deriv. when beta_0 = 1, beta_1 = 0.5 and phi = -0.1 is: ", pd_2.evalf(2, subs = {beta_0: 1.0, beta_1: 0.5, phi: -0.1}))
1st partial deriv. when beta_0 = 1 is:  1.0
2nd partial deriv. when beta_0 = 1, beta_1 = 0.5 and phi = -0.1 is:  0.40



Time Series Regression when Both $Y$ and $X$ are Trend Stationary: Spurious Regression

The following data contains the yearly observations on Puerto Rican employment rate, minimum wage and other variables.

In [14]:
txt1 = "http://uosis.mif.vu.lt/~rlapinskas/(data%20R&GRETL/prmin.data.xls"
#Read the data from the url:
puerto = pd.read_excel(txt1, header = None)
puerto.columns = ["year","avgmin","avgwage","kaitz","avgcov","covt","mfgwage", "prdef",
                  "prepop","prepopf","prgnp","prunemp","usgnp","tt","post74", "lprunemp",
                  "lprgnp","lusgnp","lkaitz","lprun_1","lprepop","lprep_1",
                  "mincov","lmincov","lavgmin"]
#Add a time index:
puerto.index = pd.date_range(start = "1950-01-01", periods = len(puerto.index), freq = "A").to_period()
#Example of different date formats:
print(puerto.index[:4])
PeriodIndex(['1950', '1951', '1952', '1953'], dtype='period[A-DEC]', freq='A-DEC')
In [15]:
puerto.head()
Out[15]:
year avgmin avgwage kaitz avgcov covt mfgwage prdef prepop prepopf ... lprunemp lprgnp lusgnp lkaitz lprun_1 lprepop lprep_1 mincov lmincov lavgmin
1950 1950 0.198 0.398 0.155 0.201 0.29 0.43 0.859 0.470 0.470 ... 2.734367 6.778443 7.093155 -1.864330 . -0.755023 . 0.099995 -2.302635 -1.619488
1951 1951 0.209 0.410 0.164 0.207 0.29 0.45 0.881 0.449 0.449 ... 2.772589 6.829794 7.191580 -1.807889 -1.8708 -0.800732 -0.755023 0.105520 -2.248859 -1.565421
1952 1952 0.225 0.421 0.180 0.226 0.29 0.48 0.953 0.434 0.434 ... 2.694627 6.923530 7.229839 -1.714798 -1.83258 -0.834711 -0.800732 0.120784 -2.113753 -1.491655
1953 1953 0.311 0.480 0.229 0.231 0.29 0.50 0.970 0.428 0.428 ... 2.674149 6.985919 7.269129 -1.474033 -1.91054 -0.848632 -0.834711 0.149669 -1.899331 -1.167962
1954 1954 0.313 0.508 0.211 0.224 0.29 0.52 1.000 0.415 0.415 ... 2.727853 7.007058 7.255733 -1.555897 -1.93102 -0.879477 -0.848632 0.138016 -1.980387 -1.161552

5 rows × 25 columns

We will estimate the following model: $$\text{lprepop}_t = \beta_0 + \beta_1 \cdot \text{lmincov}_t + \beta_2 \cdot \text{lusgnp}_t + \epsilon_t, \quad \epsilon_t \sim WN(0, \sigma^2)$$ where:

  • $\text{lprepop} = \log \text{(employed_pop / popul_total)}$ - log of employment rate in Puerto Rico;
    • $employed\text{_}pop$ - Puerto Rico employed population;
    • $popul\text{_}total$ - Puerto Rico total working-age population
  • $\text{lmincov} = \log \text{(avgmin / avgwage) * avgcov}$ - the importance of minimum wage relative to average wages;
    • $avgmin$ - the average minimum wage;
    • $avgwage$ - the average overall wage;
    • $avgcov$ - the average coverage rate (the proportion of workers actually covered by the minimum wage law);
  • $\text{lusgnp} = \log \text{(USGNP)}$;
    • $USGNP$ - real US Gross National Product in billions of dollars.
In [16]:
fig = plt.figure()
puerto[['lusgnp', 'lmincov', 'lprepop']].plot(subplots=True, layout=(3,1))
plt.legend()
plt.tight_layout()
plt.show()
<matplotlib.figure.Figure at 0x1ac31c6b4e0>

All the variables have a trend close to linear. Note that an accurate analysis would require to test for unit root first!

In [17]:
puerto_M1_est = smf.ols(formula = 'lprepop ~ 1 + lmincov + lusgnp', data = puerto)
puerto_M1_fit = puerto_M1_est.fit()
#Extract the estimated coefficient and p-value:
print(pd.DataFrame([puerto_M1_fit.params, puerto_M1_fit.pvalues], index = ["coef", "p-value"]).T)
               coef   p-value
Intercept -1.054414  0.177077
lmincov   -0.154443  0.022911
lusgnp    -0.012190  0.891254

Thus, if the minimum wage increases then employment declines which matches classical economics.

On the other hand, the GNP seems to be insignificant, but we shall test the claim right now by including the trend tt:

In [18]:
puerto_M2_est = smf.ols(formula = 'lprepop ~ 1 + lmincov + lusgnp + tt', data = puerto)
puerto_M2_est = puerto_M2_est.fit()
print(pd.DataFrame([puerto_M2_est.params, puerto_M2_est.pvalues], index = ["coef", "p-value"]).T)
               coef       p-value
Intercept -8.696292  1.042422e-07
lmincov   -0.168695  5.521311e-04
lusgnp     1.057350  8.980642e-07
tt        -0.032354  2.313442e-07

Here, we interpret the coefficient of lusgnp as follows: if usgnp rises by 1% above its long-run trend, then prepop will rise an extra 1.057% above its long-run trend.

Note that the above regression is equivalent to this (rewritten initial equation in the form of deviations from the trend):

In [19]:
#Get the residuals of the DE-TRENDED series::
res_1 = smf.ols(formula = 'lprepop ~ 1 + tt', data = puerto).fit().resid
res_2 = smf.ols(formula = 'lmincov ~ 1 + tt', data = puerto).fit().resid
res_3 = smf.ols(formula = 'lusgnp ~ 1 + tt', data = puerto).fit().resid
res_data = pd.DataFrame([res_1, res_2, res_3], index = ["lprepop_detrend", "lmincov_detrend", "lusgnp_detrend"]).T
print("\n", res_data.head(), "\n\n")

puerto_detrended = smf.ols(formula = 'lprepop_detrend ~ 1 + lmincov_detrend + lusgnp_detrend', data = res_data).fit()
print(pd.DataFrame([puerto_detrended.params, puerto_detrended.pvalues], index = ["coef", "p-value"]).T)
       lprepop_detrend  lmincov_detrend  lusgnp_detrend
1950         0.062096        -0.249825       -0.061742
1951         0.023248        -0.235469        0.006285
1952        -0.003868        -0.139783        0.014146
1953        -0.010926         0.035219        0.023037
1954        -0.034909        -0.085257       -0.020757 


                         coef       p-value
Intercept       -4.165505e-16  1.000000e+00
lmincov_detrend -1.686946e-01  4.561927e-04
lusgnp_detrend   1.057350e+00  6.181506e-07

Note that we already removed the Intercept in the trend regressions so not only is it insignificant, but the coefficient is very small, so let's remove it from this regression:

In [20]:
puerto_detrended = smf.ols(formula = 'lprepop_detrend ~ -1 + lmincov_detrend + lusgnp_detrend', data = res_data).fit()
results = pd.DataFrame([puerto_detrended.params, puerto_detrended.pvalues], index = ["coef", "p-value"]).T
print(np.round(results, 4))
                   coef  p-value
lmincov_detrend -0.1687   0.0004
lusgnp_detrend   1.0573   0.0000

We can see that accounting for a time effect determines the significance of the exogenous variables in our model.




Time Series Regression when $Y$ and $X$ Have Unit Roots: Spurious Regression

Let us analyze daily IBM stock prices spanning May 17, 1961 to November 2, 1962 (369 days in all) and daily closing prices of German DAX index starting at the 130th day of 1991.

OPTION 1: LOAD THE DATASET FROM A PACKAGE IN R:

We can install rpy2 - a package to use R inside of Python. More information here. This allows us to load datasets from R in Python. Open the Anaconda Navigator -> Environments. Click the arrow on base (root) and select Open Terminal and type the following:

conda install -c r rpy2

conda install -c r r-urca

If prompted to proceed, type y to install or update the required packages.

Note that if we have more than one installation of R (some versions of Anaconda have a separate path to their own version of R), then we need to specify the location of the packages for the version of R that we want to use.

Note that the newest version of R packages are located in:

C:\Users\[Your PC Username Here]\Documents\R\win-library\

Then the folder with the highest version number is the package folder of the newest R version. If needed, an older folder can also be specified. In our case, the packages are located in:

In [21]:
from rpy2.robjects.packages import importr

waveslim = importr('waveslim', lib_loc = "C:/Users/Andrius/Documents/R/win-library/3.4")

Next, we need to load the ibm data as we would do in R with data(ibm):

In [22]:
utils = importr("utils")
utils.data("ibm")
Out[22]:
StrVector with 1 elements.
'ibm'

Finally, we can transform the R object into an array in Python:

In [23]:
from rpy2.robjects import r, pandas2ri

ibm_data = pandas2ri.ri2py(r["ibm"])
ibm_data = pd.Series(ibm_data)
print(ibm_data.head())
0    460.0
1    457.0
2    452.0
3    459.0
4    462.0
dtype: float64
In [24]:
dax_data = pd.read_csv("https://vincentarelbundock.github.io/Rdatasets/csv/datasets/EuStockMarkets.csv")
dax_data = dax_data["DAX"][:369]
dax_data = pd.Series(dax_data, name = "DAX Index Price")
dax_data.head()
Out[24]:
0    1628.75
1    1613.63
2    1606.51
3    1621.04
4    1618.16
Name: DAX Index Price, dtype: float64

OPTION2: if the data does not load from R, or if you do not have R installed, use the following code to load the data:

In [25]:
ibm_data = [460., 457., 452., 459., 462., 459., 463., 479., 493., 490., 492.,
       498., 499., 497., 496., 490., 489., 478., 487., 491., 487., 482.,
       479., 478., 479., 477., 479., 475., 479., 476., 476., 478., 479.,
       477., 476., 475., 475., 473., 474., 474., 474., 465., 466., 467.,
       471., 471., 467., 473., 481., 488., 490., 489., 489., 485., 491.,
       492., 494., 499., 498., 500., 497., 494., 495., 500., 504., 513.,
       511., 514., 510., 509., 515., 519., 523., 519., 523., 531., 547.,
       551., 547., 541., 545., 549., 545., 549., 547., 543., 540., 539.,
       532., 517., 527., 540., 542., 538., 541., 541., 547., 553., 559.,
       557., 557., 560., 571., 571., 569., 575., 580., 584., 585., 590.,
       599., 603., 599., 596., 585., 587., 585., 581., 583., 592., 592.,
       596., 596., 595., 598., 598., 595., 595., 592., 588., 582., 576.,
       578., 589., 585., 580., 579., 584., 581., 581., 577., 577., 578.,
       580., 586., 583., 581., 576., 571., 575., 575., 573., 577., 582.,
       584., 579., 572., 577., 571., 560., 549., 556., 557., 563., 564.,
       567., 561., 559., 553., 553., 553., 547., 550., 544., 541., 532.,
       525., 542., 555., 558., 551., 551., 552., 553., 557., 557., 548.,
       547., 545., 545., 539., 539., 535., 537., 535., 536., 537., 543.,
       548., 546., 547., 548., 549., 553., 553., 552., 551., 550., 553.,
       554., 551., 551., 545., 547., 547., 537., 539., 538., 533., 525.,
       513., 510., 521., 521., 521., 523., 516., 511., 518., 517., 520.,
       519., 519., 519., 518., 513., 499., 485., 454., 462., 473., 482.,
       486., 475., 459., 451., 453., 446., 455., 452., 457., 449., 450.,
       435., 415., 398., 399., 361., 383., 393., 385., 360., 364., 365.,
       370., 374., 359., 335., 323., 306., 333., 330., 336., 328., 316.,
       320., 332., 320., 333., 344., 339., 350., 351., 350., 345., 350.,
       359., 375., 379., 376., 382., 370., 365., 367., 372., 373., 363.,
       371., 369., 376., 387., 387., 376., 385., 385., 380., 373., 382.,
       377., 376., 379., 386., 387., 386., 389., 394., 393., 409., 411.,
       409., 408., 393., 391., 388., 396., 387., 383., 388., 382., 384.,
       382., 383., 383., 388., 395., 392., 386., 383., 377., 364., 369.,
       355., 350., 353., 340., 350., 349., 358., 360., 360., 366., 359.,
       356., 355., 367., 357., 361., 355., 348., 343., 330., 340., 339.,
       331., 345., 352., 346., 352., 357.]

ibm_data = pd.Series(ibm_data, name = "IBM Stock Price")
In [26]:
stock_data = pd.DataFrame([ibm_data, dax_data], index = ["IBM", "DAX"]).T
plt.figure()
stock_data.plot(subplots=True, layout=(2,1))
plt.legend()
plt.show()
<matplotlib.figure.Figure at 0x1ac31ef4978>

Lets create a static model by regressing the price of IBM stocks on the price of DAX index:

In [27]:
ID_est = smf.ols(formula = 'IBM ~ 1 + DAX', data = stock_data)
ID_fit = ID_est.fit()
print(np.round(pd.DataFrame([ID_fit.params, ID_fit.pvalues], index = ["coef", "p-value"]).T, 4))
              coef  p-value
Intercept  31.9930   0.6659
DAX         0.2735   0.0000

Though, because of their nature, IBM and DAX should not have any relationship, the coefficient of regression is very significant. This is an example of a spurious regression which can be explained through the fact that the errors of the model have a unit root.

Note: to have a similar formula functionality to the one in R's dynlm package, lets define all the functions that we will need:

In [28]:
#Lag operator:
def L(x, n = 1):
    if n == 0:
        return x
    if isinstance(x, pd.Series):
        return x.shift(n) 
    else:
        x = pd.Series(x)
        return x.shift(n)

#Difference operator:
def d(x):
    if isinstance(x, pd.Series):
        return x.diff() 
    else:
        x = pd.Series(x)
        return x.diff() 

#Trend/time operator:
def time(x):
    return list(range(1, len(x) + 1))

We can test the functions to verify that they work correctly:

In [29]:
test_series = pd.Series(list(range(3, 23)))
print("Series: ", test_series.tolist())
print("lagged series: ", L(test_series, 1).tolist())
print("diff. series: ", d(test_series).tolist())
print("trend: ", time(test_series))
Series:  [3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22]
lagged series:  [nan, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0, 21.0]
diff. series:  [nan, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
trend:  [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]

The following should be carried out:

  1. Establish that IBM has a unit root;
  2. Establish that DAX has a unit root;
  3. Verify that the errors of the ID model have a unit root.

The (1) and (2) steps can be carried out in the same manner as in the previous lectures via the Augmented Dickey-Fuller unit root test, using the automated lag order selection procedure according to the AIC value.

Note that for the regression value the possible values are {‘c’,’ct’,’ctt’,’nc’} - constant and trend order to include in regression

  • c : constant only (default);
  • ct : constant and trend;
  • ctt : constant, and linear and quadratic trend;
  • nc : no constant, no trend;
In [30]:
IBM_adf = smt.stattools.adfuller(stock_data["IBM"], maxlag = 5, autolag = 'AIC', regression = 'c', regresults = True)
IBM_adf
Out[30]:
(-0.34269996414579457,
 0.9192352438659351,
 {'1%': -3.448294490928673,
  '10%': -2.570982681065269,
  '5%': -2.869447722240253},
 <statsmodels.tsa.stattools.ResultsStore at 0x1ac35b52518>)

We see that:

- The 1st element is the test statistic of the ADF test;
- The 2nd element is the p-value associated with the test statistic;
- The 3rd element is a dictionary of the critical values for the test statistic at the 1 %, 5 %, and 10 % levels;
- The 4th element is the additional regression results returned from the test;
In [31]:
#Get the regression results:
ibm_adf_mdl = IBM_adf[3]

We can look at the possible attributes which we can access in the variable:

In [32]:
dir(IBM_adf[3])
Out[32]:
['H0',
 'HA',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_str',
 'adfstat',
 'autolag_results',
 'critvalues',
 'icbest',
 'maxlag',
 'nobs',
 'resols',
 'usedlag']

We can print the hypothesis of the test:

In [33]:
print("H0: ", ibm_adf_mdl.H0)
print("H1: ", ibm_adf_mdl.HA)
H0:  The coefficient on the lagged level equals 1 - unit root
H1:  The coefficient on the lagged level < 1 - stationary

Note that the use of the word level, which indicates that the coefficient if for the variable $Y_t$ and not $\Delta Y_t$

In [34]:
print("Number of Delta Y_t lags in the ADF regression : ", ibm_adf_mdl.usedlag)
print("ADF test statistic: ", ibm_adf_mdl.adfstat)
print("ADF critical values: ", ibm_adf_mdl.critvalues)
Number of Delta Y_t lags in the ADF regression :  1
ADF test statistic:  -0.34269996414579457
ADF critical values:  {'1%': -3.448294490928673, '5%': -2.869447722240253, '10%': -2.570982681065269}

Because the ADF test-statistic is greater than the 5% critical value, we do not reject the null hypothesis that the time series has a unit root, meaning it is non-stationary (i.e. that it is DS).

The same can be said if we look at the p-value:

In [35]:
np.round(IBM_adf[1], 4)
Out[35]:
0.9192

Becasue p-value > 0.05, we do not reject the null hypothesis, which means that IBM time series is a DS process (or, equivalently, an integrated series).

We can also look at the regression carried out via OLS:

In [36]:
print(ibm_adf_mdl.resols.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.008
Model:                            OLS   Adj. R-squared:                  0.002
Method:                 Least Squares   F-statistic:                     1.404
Date:                Mon, 09 Apr 2018   Prob (F-statistic):              0.247
Time:                        13:19:34   Log-Likelihood:                -1246.7
No. Observations:                 367   AIC:                             2499.
Df Residuals:                     364   BIC:                             2511.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
x1            -0.0015      0.005     -0.343      0.732      -0.010       0.007
x2             0.0871      0.052      1.662      0.097      -0.016       0.190
const          0.4949      2.198      0.225      0.822      -3.828       4.818
==============================================================================
Omnibus:                       48.240   Durbin-Watson:                   1.998
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              184.165
Skew:                          -0.496   Prob(JB):                     1.02e-40
Kurtosis:                       6.326   Cond. No.                     2.82e+03
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.82e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

Note that:

  • x1 - because $\text{t-statistic} = -0.343$, i.e. x1 is the $\rho$ coefficient when testing the null hypothesis for unit root, i.e. $H_0: \rho = 0$, or, equivalently (using the definition of $\rho$), $H_0: \phi_1 + ... + \phi_{p+1} = 1$;
  • x2 - because ibm_adf_mdl.usedlag is equal to 1, x2 is the $\gamma_1$ coefficient for $\Delta Y_{t-1}$;

    So, we confirmed, that IBM has a unit root. We can do the same for DAX (here we present only the nelevant results):

In [37]:
DAX_adf = smt.stattools.adfuller(stock_data["DAX"], maxlag = 5, autolag = 'AIC', regression = 'c', regresults = True)
print("ADF test statistic", DAX_adf[0])
print("ADF test p-value: ", DAX_adf[1])
print("ADF critical values: ", DAX_adf[2])
ADF test statistic -1.4490774798904043
ADF test p-value:  0.5585349781590825
ADF critical values:  {'1%': -3.4482453822848496, '5%': -2.8694261442901396, '10%': -2.5709711770439507}

Since the p-value is greater than 0.05, we do not reject the null hypothesis of a unit root.

Note, that the hypothesis and the selected lag order can be extracted as before:

In [38]:
print("H0: ", DAX_adf[3].H0)
print("H1: ", DAX_adf[3].HA)
print("Lag order : ", DAX_adf[3].usedlag)
H0:  The coefficient on the lagged level equals 1 - unit root
H1:  The coefficient on the lagged level < 1 - stationary
Lag order :  0

Regarding step (3):

In [39]:
ID_res = pd.DataFrame([ID_fit.resid], index = ["ID_res"]).T
ID_res_est = smf.ols(formula = 'd(ID_res) ~ 1 + L(ID_res, 1) + time(ID_res)', data = ID_res)
ID_res_fit = ID_res_est.fit()
print(np.round(pd.DataFrame([ID_res_fit.params, ID_res_fit.pvalues], index = ["coef", "p-value"]).T, 4))
                coef  p-value
Intercept     1.8987   0.0820
L(ID_res, 1) -0.0134   0.0604
time(ID_res) -0.0113   0.0358

Remember, that we have previously defined d(), L() and time() functions, which lets us use them in the formula expression.

Recall that the unit root hypothesis $H_0$ claims that the coefficient at $\text{L(ID_res, 1)}$ equals zero. In this case of testing for a unit root in errors, we have a further complication: here the t-statistics has not the Dickey-Fuller, but the Engle-Granger distribution whose critical values are given in this table:

No. of variables
(included Y)

1%
Significance level
5%

10%
2 -3.90 -3.34 -3.04
3 -4.29 -3.74 -3.45
4 -4.64 -4.10 -3.81
5 -4.96 -4.42 -4.13

In order to test the cointegration of these two variables note that the t-statistics equals $-1.884$ which is closer to $0$ than $-3.34$, therefore there is no ground to reject $H_0$.

Thus, the errors have a unit root or, in other words, $IBM$ and $DAX$ are not cointegrated and the strong regression bonds are only spurious.




Time Series Regression when $Y$ and $X$ Have Unit Roots: Cointegration

It is well known that many economic time series are $DS$ and therefore the regression for levels is often misleading (the standard Student or Wald tests provide wrong p-values). On the other hand, if these $DS$ series are cointegrated, the OLS method is OK.

Recall that if $Y$ and $X$ have unit roots (i.e. are non-stationary), but some linear combination $Y_t - \alpha - \beta X_t$ is (Trend) Stationary, then we say that $Y$ and $X$ are cointegrated.

In order to establish cointegration, we estimate the unknown coefficients $\alpha$ and $\beta$ via OLS and then test whether the errors of the model have a unit root: $$ \begin{aligned} Y_t &= \alpha + \beta X_t\ (+ \delta t) + \epsilon_t \\ &\Downarrow_{OLS}\\ \widehat{e}_t &= Y_t - \widehat{\alpha} - \widehat{\beta} X_t\ (- \widehat{\delta} t ) \end{aligned} $$

The example dataset contains quarterly data 1947:Q1 - 1989:Q3, where:

  • lc - logarithms of the real quarterly personal expenditure (i.e. consumption);
  • ly - logarithms of the real quarterly aggregated disposable income'.

Note: to use different frequencies see here.

In [40]:
url = "http://uosis.mif.vu.lt/~rlapinskas/(data%20R&GRETL/hamilton.txt"
ham = pd.read_csv(url, sep = " ")
ham.index = pd.date_range(start = ham["quarter"][0], periods = len(ham.index), freq = "Q").to_period()
ham = ham[["lc", "ly"]]

plt.figure()
ham.plot(subplots = False)
plt.legend()
plt.show()
<matplotlib.figure.Figure at 0x1ac35b52320>

It seems that both time series may have a trend, as such, our cointegration equation will include a time trend included as well, i.e.: $$Y_t = \alpha + \beta X_t\ + \delta t + \epsilon_t$$

In [41]:
coint_eq = smf.ols(formula = 'lc ~ 1 + ly + time(ly)', data = ham).fit()
print(np.round(pd.DataFrame([coint_eq.params, coint_eq.pvalues], index = ["coef", "p-value"]).T, 4))
               coef  p-value
Intercept  217.4099      0.0
ly           0.6557      0.0
time(ly)     0.2852      0.0
In [42]:
plt.figure(figsize = (10, 5))
plt.scatter(ham["ly"], ham["lc"])
plt.plot(ham["ly"], coint_eq.fittedvalues, color = "red")
plt.xlabel("ly")
plt.ylabel("lc")
plt.show()
In [43]:
plt.figure(figsize = (10, 5))
coint_eq.resid.plot()
plt.hlines(y = 0, xmin = coint_eq.resid.index[0], xmax = coint_eq.resid.index[-1], color = "red")
plt.show()

It is easy to verify that both time series have unit roots and their final models are (note that we are including the trend coefficient as well) via Procedure C from the previous Lecture:

In [44]:
mod_ly = smt.stattools.adfuller(ham["ly"], maxlag = 5, autolag = 'AIC', regression = 'ct', regresults = True)
print("Lag order selected: ", mod_ly[3].usedlag)
print("ADF test statistic", mod_ly[0])
print("ADF test p-value: ", mod_ly[1])
print("ADF critical values: ", mod_ly[2])
Lag order selected:  0
ADF test statistic -1.7697135404264883
ADF test p-value:  0.7191756082051783
ADF critical values:  {'1%': -4.013034503358437, '5%': -3.436637782210462, '10%': -3.142399780175046}
In [45]:
mod_lc = smt.stattools.adfuller(ham["lc"], maxlag = 5, autolag = 'AIC', regression = 'ct', regresults = True)
print("Lag order selected: ", mod_lc[3].usedlag)
print("ADF test statistic", mod_lc[0])
print("ADF test p-value: ", mod_lc[1])
print("ADF critical values: ", mod_lc[2])
Lag order selected:  2
ADF test statistic -2.391784347251651
ADF test p-value:  0.3839956844035152
ADF critical values:  {'1%': -4.013693020892385, '5%': -3.436953055673658, '10%': -3.14258426212747}

We see that in both cases, the associalted p-values are greater than 0.05, so we do not reject the null hypothesis of a unit root for each of the time series.

We can also do the same thing manually (remember Procedure A from the previous Lectures):

In [46]:
#Initial model for unit root test:
adf_lc = smf.ols(formula = 'd(lc) ~ 1 + L(lc, 1) + time(lc) + L(d(lc), 1) + L(d(lc), 2) + L(d(lc), 3) + L(d(lc), 4)', data = ham).fit()
print(np.round(pd.DataFrame([adf_lc.params, adf_lc.pvalues], index = ["coef", "p-value"]).T, 4), "\n\n -------------- \n")
#Now drop the last lag if it is insignificant:
adf_lc = smf.ols(formula = 'd(lc) ~ 1 + L(lc, 1) + time(lc) + L(d(lc), 1) + L(d(lc), 2) + L(d(lc), 3)', data = ham).fit()
print(np.round(pd.DataFrame([adf_lc.params, adf_lc.pvalues], index = ["coef", "p-value"]).T, 4), "\n\n -------------- \n")
#Drop the 3rd lag as variable as it is insignificant:
adf_lc = smf.ols(formula = 'd(lc) ~ 1 + L(lc, 1) + time(lc) + L(d(lc), 1) + L(d(lc), 2)', data = ham).fit()
print(np.round(pd.DataFrame([adf_lc.params, adf_lc.pvalues], index = ["coef", "p-value"]).T, 4), "\n\n -------------- \n")
                coef  p-value
Intercept    33.7057   0.0225
L(lc, 1)     -0.0512   0.0254
time(lc)      0.0430   0.0272
L(d(lc), 1)   0.0476   0.5441
L(d(lc), 2)   0.2601   0.0010
L(d(lc), 3)   0.0551   0.4853
L(d(lc), 4)  -0.0555   0.4804 

 -------------- 

                coef  p-value
Intercept    35.7358   0.0133
L(lc, 1)     -0.0544   0.0150
time(lc)      0.0456   0.0163
L(d(lc), 1)   0.0455   0.5574
L(d(lc), 2)   0.2486   0.0013
L(d(lc), 3)   0.0576   0.4609 

 -------------- 

                coef  p-value
Intercept    34.2288   0.0159
L(lc, 1)     -0.0520   0.0179
time(lc)      0.0439   0.0187
L(d(lc), 1)   0.0655   0.3895
L(d(lc), 2)   0.2395   0.0019 

 -------------- 

We see that the largest lag $\Delta Y_{t-2}$ coefficient is significant, so our lag order is 2. The time coefficient time(lc) is also significant.

Looking at the t-statistic of L(lc, 1):

In [47]:
print(adf_lc.tvalues)
Intercept      2.437134
L(lc, 1)      -2.391784
time(lc)       2.375246
L(d(lc), 1)    0.862783
L(d(lc), 2)    3.157062
dtype: float64

The value is -2.391784 > −3.45, so we do not reject the null hypothesis of a unit root. The same can be shown for the variable ly as well (verify this!).

Now, we want to test whether the errors of the model: $$\text{lc}_t = \alpha + \beta \cdot \text{ly}_t + \gamma t + \epsilon_t$$

make a stationary process, i.e. do not have a unit root. We will try two different cointegration models - one with a trend and one without:

In [48]:
cy_res_no_trend = pd.DataFrame([smf.ols(formula = 'lc ~ 1 + ly', data = ham).fit().resid], index = ["res"]).T
cy_res_trend = pd.DataFrame([smf.ols(formula = 'lc ~ 1 + ly + time(lc)', data = ham).fit().resid], index = ["res"]).T

We want to test whether $|\phi| < 1$ from the model: $$\text{cy_res}_t = \phi \cdot \text{cy_res}_{t-1} + \epsilon_t$$

However, if the errors of this model do not constitute WN, the standard OLS procedure gives the erroneous (i.e. incorrect) estimate of $\phi$. Then:

  • First approach is to replace the $AR(1)$ model by an $AR(p)$ model;
  • Second approach is to use the $AR(1)$ model, but to take into account the fact that the errors are not $WN$ - this could be done by the Phillips-Outliaris test.

First Method

This method is also called the Engle-Granger method, it tests for the unit root in errors hypothesis. However, the OLS procedure proposes the coefficients to the cointegration equation such that the variance of the residuals is minimum, thus the residuals are too stationary. In other words, the null hypothesis for a unit root will be rejected too often.

This is why we use other critical values to test the null hypothesis $H_0: \rho - 0$ for $\Delta e_y = \rho e_{t-1} + \gamma_1 \Delta e_{t-1} + ... + \gamma_{p-1} \Delta e_{t-p+1} + w_t$.

The asymptotic 5% critical values for residual unit root tests for cointegration in: $$Y_t = \alpha + \beta X_t + \epsilon_t$$

No. of X's in the
right-hand-side of eq.
No deterministic
terms in eq.
Only constant
in eq.
Constant and
Trend in eq.
1 -2.76 -3.37 -3.80
2 -3.27 -3.77 -4.16
3 -3.74 -4.11 -4.49
4 -4.12 -4.45 -4.74
5 -4.40 -4.71 -5.03

we shall perform the test manually. Here is the final model of the residuals is:

In [49]:
res_coint_no_trend_eq = smf.ols(formula = 'd(res) ~ -1 + L(res, 1) + L(d(res), 1) + L(d(res), 2) + L(d(res), 3) + L(d(res), 4)', data = cy_res_no_trend).fit()
print(np.round(pd.DataFrame([res_coint_no_trend_eq.params, res_coint_no_trend_eq.pvalues], index = ["coef", "p-value"]).T, 4), "\n-----\n")
print(res_coint_no_trend_eq.tvalues)
                coef  p-value
L(res, 1)    -0.1450   0.0066
L(d(res), 1) -0.2725   0.0008
L(d(res), 2) -0.0806   0.3252
L(d(res), 3) -0.0378   0.6406
L(d(res), 4) -0.2296   0.0020 
-----

L(res, 1)      -2.751977
L(d(res), 1)   -3.414327
L(d(res), 2)   -0.986798
L(d(res), 3)   -0.467790
L(d(res), 4)   -3.137745
dtype: float64

In this case, we have that since -2.752 is closer to zero than -3.37, we do not reject the unit root in errors null hypothesis, thus the processes lc and ly are not cointegrated.

In [50]:
res_coint_trend_eq =smf.ols(formula = 'd(res) ~ -1 + L(res, 1) + L(d(res), 1)', data = cy_res_trend).fit()
print(np.round(pd.DataFrame([res_coint_trend_eq.params, res_coint_trend_eq.pvalues], index = ["coef", "p-value"]).T, 4), "\n-----\n")
print(res_coint_trend_eq.tvalues)
                coef  p-value
L(res, 1)    -0.2415   0.0000
L(d(res), 1) -0.2208   0.0024 
-----

L(res, 1)      -4.536843
L(d(res), 1)   -3.084486
dtype: float64

On the other hand, if we include a trend component, then our t-statistic -4.5368 < -3.80. In this case, we reject the null hypothesis, i.e. we would get that lc and ly are cointegrated!

This highlights one drawback of the Engle-Granger procedure - Engle-Granger test does not consistently indicate cointegration.

One possible explanation is that the Engle-Granger and Dickey-Fuller tests are known to have low power.

Second Method

The Phillips-Ouliaris cointegration test: $$H_0: \text{x and y are not cointegrated}$$

Note that Python does not have an implementation of this test. This can be somewhat remedied by passing instructions to R from Python:

In [51]:
import rpy2.robjects as ro

#Append our actual library directory to the current one:
ro.r('.libPaths(c(.libPaths(),"C:/Users/Andrius/Documents/R/win-library/3.4"))')
print(ro.r('.libPaths()'))
[1] "C:/Users/Andrius/Anaconda3/Lib/R/library"    

[2] "C:/Users/Andrius/Documents/R/win-library/3.4"

In [52]:
ro.globalenv['lc'] = ro.FloatVector(ham["lc"])
ro.globalenv['ly'] = ro.FloatVector(ham["ly"])
ro.globalenv['cy_data'] = ro.r('cbind(lc,ly)')

ro.r('library(urca)')
ro.globalenv['cy_po'] = ro.r('ca.po(cy_data, demean = "trend", type = "Pz")')
print(ro.r('summary(cy_po)'))

######################################## 

# Phillips and Ouliaris Unit Root Test # 

######################################## 



Test of type Pz 

detrending of series with constant and linear trend 



Response lc :



Call:

lm(formula = lc ~ zr + trd)



Residuals:

     Min       1Q   Median       3Q      Max 

-2.85014 -0.37853 -0.00368  0.50769  3.08677 



Coefficients:

            Estimate Std. Error t value Pr(>|t|)    

(Intercept) 33.85240   14.79603   2.288   0.0234 *  

zrlc         0.88800    0.04841  18.345   <2e-16 ***

zrly         0.06024    0.03552   1.696   0.0918 .  

trd          0.04291    0.01940   2.212   0.0283 *  

---

Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1



Residual standard error: 0.7874 on 166 degrees of freedom

Multiple R-squared:  0.9997,	Adjusted R-squared:  0.9996 

F-statistic: 1.582e+05 on 3 and 166 DF,  p-value: < 2.2e-16





Response ly :



Call:

lm(formula = ly ~ zr + trd)



Residuals:

    Min      1Q  Median      3Q     Max 

-3.4872 -0.4648  0.0132  0.5494  5.3991 



Coefficients:

             Estimate Std. Error t value Pr(>|t|)    

(Intercept) -28.47975   19.97038  -1.426 0.155717    

zrlc          0.25511    0.06533   3.905 0.000137 ***

zrly          0.79301    0.04795  16.539  < 2e-16 ***

trd          -0.03992    0.02619  -1.524 0.129349    

---

Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1



Residual standard error: 1.063 on 166 degrees of freedom

Multiple R-squared:  0.9994,	Adjusted R-squared:  0.9994 

F-statistic: 8.897e+04 on 3 and 166 DF,  p-value: < 2.2e-16







Value of test-statistic is: 49.341 



Critical values of Pz are:

                  10pct    5pct     1pct

critical values 71.9586 81.3812 102.0167



The test type = "Pz", compared to the test type = "Pu", has the advantage that it is invariant to the normalization of the cointegration vector, i.e. it does not matter which variable is on the left hand side of the equation.

In [53]:
#More compactly - use `capture.output` and print the last 6 lines of the summary:
ro.globalenv['cy_po_summary'] = ro.r('capture.output(summary(cy_po))')
print(ro.r('cy_po_summary[(length(cy_po_summary) - 6):length(cy_po_summary)]'))
[1] ""                                        

[2] "Value of test-statistic is: 49.341 "     

[3] ""                                        

[4] "Critical values of Pz are:"              

[5] "                  10pct    5pct     1pct"

[6] "critical values 71.9586 81.3812 102.0167"

[7] ""                                        

The test statistics 49.341 is closer to 0 than 81.3812 (alternatively, 49.341 < 81.3812), therefore we do not reject the null hypothesis and conclude that lc and ly are not cointegrated.




Time Series Regression when $Y$ and $X$ are Cointegrated: The Error Correction Model (ECM)

If $X$ and $Y$ are integrated but not cointegrated, their ties (in levels) can only be spurious.

In order to get sensible results, the model may be augmented with new variables or another $ADL$ model (for stationary differences) may be created, like: $$\Delta Y_t = \phi + \gamma_1 \Delta Y_{t-1} + ... + \gamma_p \Delta Y_{t-p} + \omega_0 \Delta X_t + ... + \omega_q \Delta X_q + \epsilon_t$$

Similar model can also be analyzed in the case of cointegration, but then one more term, the so-called error correction term $\widehat{u}_{t-1}$ (where $Y_t = \alpha + \beta X_t + u_t$ is the long-run equilibrium, or cointegration, equation) should be included: $$\Delta Y_t = \phi + \delta t + \lambda \widehat{u}_{t-1} + \gamma_1 \Delta Y_{t-1} + ... + \gamma_p \Delta Y_{t-p} + \omega_0 \Delta X_t + ... + \omega_q \Delta X_q + \epsilon_t$$

This is the Error Correction Model (ECM), where the coefficient $\lambda$ must be negative. E.g., if $\lambda = - 0.25$, then a unit deviation from the equilibrium will be compensated in $1/0.25 = 4$ time units.

The ECM is usually created in two steps:

  1. Estimate the long-run equilibrium model: $Y_t = \alpha + \beta X_t + u_t$ and save the residuals;
  2. Estimate the regression model: $\Delta Y_t = \phi + \delta t + \lambda \widehat{u}_{t-1} + \gamma_1 \Delta Y_{t-1} + ... + \gamma_p \Delta Y_{t-p} + \omega_0 \Delta X_t + ... + \omega_q \Delta X_q + \epsilon_t$.

Let's say we have a dataset where:

  • $NF$ = expenditure on food in current prices;
  • $RF$ = expenditure on food in constant prices, i.e. the demand for food, $Q$;
  • $NTE$ = total expenditure in current prices, i.e. the total expenditure, $X$;
  • $RTE$ = the total expenditure in constant prices.

Suppose further, that we found that $lnQ$ and $ln(X/G)$ are cointegrated (this should be tested!).

In [54]:
url = "http://uosis.mif.vu.lt/~rlapinskas/(data%20R&GRETL/data1.txt"
DT1 = pd.read_csv(url, sep = "\t")

#Fix an error when reading the file
if pd.isna(DT1.iloc[17,2]):
    DT1.iloc[17, 2] = DT1.iloc[17, 3]
    DT1.iloc[17, 3] = DT1.iloc[17, 4]
    DT1.iloc[17, 4] = DT1.iloc[17, 5]
    DT1.iloc[17, 5] = DT1.iloc[17, 6]

DT1 = DT1.iloc[:,0:5]
DT1.columns = ["Year", "NF", "RF", "NTE", "RTE"]
DT1.index = pd.date_range(start = "1963-01-01", periods = len(DT1.index), freq = "A").to_period()

X = DT1["NTE"]
Q = DT1["RF"]
P = DT1["NF"] / DT1["RF"]
G = DT1["NTE"]/ DT1["RTE"]

DT1 = pd.DataFrame([X, Q, P, G], index = ["X", "Q", "P", "G"]).T

plt.figure()
DT1.plot(subplots = True, layout = (2, 2))
plt.tight_layout()
plt.show()
<matplotlib.figure.Figure at 0x1ac35d860f0>

Let's specify the cointegration (equilibrium) equation:

In [55]:
ci = smf.ols(formula = 'np.log(Q) ~ 1 + np.log(X/G)', data = DT1).fit()
print(np.round(pd.DataFrame([ci.params, ci.tvalues, ci.pvalues], index = ["coef", "t-value", "p-value"]).T, 4))
ci_res = ci.resid
                 coef  t-value  p-value
Intercept      7.1138  28.0001      0.0
np.log(X / G)  0.3554  19.9866      0.0

We use the residuals of the cointegration equation c1 and create a regression model: $$\Delta \text{log(Q)}_t = \lambda \cdot \text{cl_res}_{t-1} + \gamma_1 \Delta \text{log(Q)}_{t-1} + \omega_0 \Delta \text{log(X/G)}_{t} + \omega_1 \Delta \text{log(X/G)}_{t-1} + \epsilon_t$$

In [56]:
#Add the residuals to the data.frame:
DT1_res = pd.concat([DT1.reset_index(drop = True), ci_res.reset_index(drop = True)], axis = 1)
DT1_res.columns = ['X', 'Q', 'P', 'G', 'ci_res']

#Create the cointegration model:
model_1 = smf.ols(formula = 'd(np.log(Q)) ~ -1 + L(ci_res) + L(d(np.log(Q)), 1) + d(np.log(X/G)) + L(d(np.log(X/G)), 1)', data = DT1_res).fit()
print(np.round(pd.DataFrame([model_1.params, model_1.tvalues, model_1.pvalues], index = ["coef", "t-value", "p-value"]).T, 4))
                          coef  t-value  p-value
L(ci_res)              -0.4233  -3.5044   0.0018
L(d(np.log(Q)), 1)      0.4051   2.5445   0.0178
d(np.log(X / G))        0.5368   3.9843   0.0005
L(d(np.log(X / G)), 1) -0.2532  -1.8247   0.0805

The error correction term has the right sign and is significant (coef. of L(ci_res) has a p-value = 0.0018 < 0.05), the model itself is quite acceptable.

We can still improve the model by including other differences of $I(1)$ variables, e.g. log-differences of relative food prices $P/G$: $$\Delta \text{log(Q)}_t = \lambda \cdot \text{cl_res}_{t-1} + \gamma_1 \Delta \text{log(Q)}_{t-1} + \omega_0 \Delta \text{log(X/G)}_{t} + \omega_1 \Delta \text{log(X/G)}_{t-1} + \widetilde{\omega}_0 \Delta \text{log(P/G)}_{t} + \widetilde{\omega}_1 \Delta \text{log(P/G)}_{t-1} + \epsilon_t$$

In [57]:
model_2 = smf.ols(formula = 'd(np.log(Q)) ~ -1 + L(ci_res) + L(d(np.log(Q)), 1) + d(np.log(X/G)) + L(d(np.log(X/G)), 1) + d(np.log(P/G)) + L(d(np.log(P/G)), 1)', data = DT1_res).fit()
print(np.round(pd.DataFrame([model_2.params, model_2.tvalues, model_2.pvalues], index = ["coef", "t-value", "p-value"]).T, 4))
                          coef  t-value  p-value
L(ci_res)              -0.3218  -2.7869   0.0108
L(d(np.log(Q)), 1)      0.3998   2.3880   0.0260
d(np.log(X / G))        0.4961   3.5568   0.0018
L(d(np.log(X / G)), 1) -0.1935  -1.1830   0.2494
d(np.log(P / G))       -0.2982  -2.5452   0.0184
L(d(np.log(P / G)), 1)  0.2309   1.7157   0.1003

After deleting insignificant terms (here we use an arbitraty 10%, instead of a 5% significance level), we get such an ECM:

In [58]:
model_2 = smf.ols(formula = 'd(np.log(Q)) ~ -1 + L(ci_res) + L(d(np.log(Q)), 1) + d(np.log(X/G)) + L(d(np.log(X/G)), 1) + d(np.log(P/G))', data = DT1_res).fit()
print(np.round(pd.DataFrame([model_2.params, model_2.tvalues, model_2.pvalues], index = ["coef", "t-value", "p-value"]).T, 4), "\n-------\n")
model_2 = smf.ols(formula = 'd(np.log(Q)) ~ -1 + L(ci_res) + L(d(np.log(Q)), 1) + d(np.log(X/G)) + d(np.log(P/G))', data = DT1_res).fit()
print(np.round(pd.DataFrame([model_2.params, model_2.tvalues, model_2.pvalues], index = ["coef", "t-value", "p-value"]).T, 4), "\n-------\n")
                          coef  t-value  p-value
L(ci_res)              -0.4005  -3.6290   0.0014
L(d(np.log(Q)), 1)      0.2690   1.7329   0.0965
d(np.log(X / G))        0.4042   3.0139   0.0062
L(d(np.log(X / G)), 1) -0.0585  -0.3915   0.6990
d(np.log(P / G))       -0.2981  -2.4433   0.0226 
-------

                      coef  t-value  p-value
L(ci_res)          -0.4044  -3.7459   0.0010
L(d(np.log(Q)), 1)  0.2401   1.7900   0.0861
d(np.log(X / G))    0.3658   4.0711   0.0004
d(np.log(P / G))   -0.3236  -3.1930   0.0039 
-------

Here the short-run demand elasticities with respect to total real expenditure and relative price are 0.3658 and -0.3236 respectively.

In the long-run, however, demand is independent of relative price, while the expenditure elasticity falls sligthly to the 0.3554 in the equilibrium relationship (see the model ci above).

We also note, that the residuals are WN:

In [59]:
fig = plt.figure(figsize = (10, 5))
model_2.resid.plot(ax = fig.add_subplot(211))
sm.graphics.tsa.plot_acf(model_2.resid, lags = 20, zero = False, ax = fig.add_subplot(223))
sm.graphics.tsa.plot_pacf(model_2.resid, lags = 20, zero = False, ax = fig.add_subplot(224))
plt.tight_layout()
plt.show()