We will examine monthly index data of the New York Stock Exchange from 1952:1 to 1996:1:
suppressPackageStartupMessages({library(forecast)})
nyse <- read.table(url("http://uosis.mif.vu.lt/~rlapinskas/(data%20R&GRETL/nyse.txt"), header = TRUE)
nyse <- ts(nyse, start = 1952, frequency = 12)
par(mfrow = c(1, 3))
plot(nyse); plot(log(nyse)); plot(diff(log(nyse)))
we shall model the variable log(nyse)
in two different but equivalent ways:
lnyse <- log(nyse)
mod.a <- arima(lnyse, c(1,0,0), xreg = time(lnyse))
mod.a
##
## Call:
## arima(x = lnyse, order = c(1, 0, 0), xreg = time(lnyse))
##
## Coefficients:
## ar1 intercept time(lnyse)
## 0.9832 -144.2528 0.0763
## s.e. 0.0076 12.4969 0.0063
##
## sigma^2 estimated as 0.001632: log likelihood = 945.18, aic = -1882.37
suppressPackageStartupMessages({library(dynlm)})
mod.l <- dynlm(lnyse ~ time(lnyse) + L(lnyse))
summary(mod.l)
##
## Time series regression with "ts" data:
## Start = 1952(2), End = 1996(1)
##
## Call:
## dynlm(formula = lnyse ~ time(lnyse) + L(lnyse))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.221927 -0.025066 0.002876 0.026943 0.147915
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.169524 1.124569 -1.929 0.0542 .
## time(lnyse) 0.001152 0.000595 1.936 0.0534 .
## L(lnyse) 0.984518 0.008150 120.792 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.04055 on 525 degrees of freedom
## Multiple R-squared: 0.9981, Adjusted R-squared: 0.9981
## F-statistic: 1.38e+05 on 2 and 525 DF, p-value: < 2.2e-16
The coefficients of the two models differ because of different parametrization, however, both models practically coincide (their residual differences are in fact zeros):
summary(mod.a$residuals - mod.l$residuals)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -8.119e-04 -3.638e-04 -3.215e-05 1.198e-04 6.636e-04 1.021e-03
The most important coefficient is that of the lagged term - in the first case it equals 0.9832
and in the second 0.984518
(do not forget that these are only the estimates of \(\phi\) from \(Y_t = \phi Y_{t-1} + w_t\)).
The estimate is very close to 1, however, we cannot test the null hypothesis \(H_0: \phi = 1\) by using the presented p-value of <2e-16
(it is used to test the hypothesis \(H_0: \phi = 0\)).
We cannot test the hypothesis \(H_0: \phi = 1\) in a standard way due to the fact that \((\hat{\phi} - 1)/s.e\), provided \(H_0\) is true, has not the Student but another, namely, the Dickey-Fuller distribution.
Note: to forecast the first model using forecast()
, we need to create a variable for the future value of time:
t.1 <- c(tail(time(lnyse), 1)) + 1 / frequency(lnyse)
time.f <- seq(from = t.1, by = 1 / frequency(lnyse), length.out = 20)
plot(forecast(mod.a, h = 20, xreg = time.f), include = 120)
We shall apply the OLS method and proceed as follows:
Now, the original hypothesis \(H_0:\) the process has a unit root will transform to \(H_0:\rho = 0\).
Let us find the ‘right’ order of the model. To do this, use the model’s output table to test the hypothesis \(\gamma_{p_{max}-1} = 0\). If the p-value > 0.05, reduce the order and repeat the calculations. When the ‘right’ order is found, we shall test the significance of \(\delta\).
The significance of \(\rho\) is tested in a different way:
If the final version contains the linear term \(\delta t\), the 5% Dickey-Fuller critical value is approx. -3.45. thus, if the t-ratio of the term \(Y_{t-1}\) is less than -3.45, the null \(H_0\) the process has a unit root is rejected and we decide that the AR process is stationary.
If the final version of the model does not contain \(\delta t\), the 5% Dickey-Fuller critical value is approx -2.89. Thus, if the t-ratio of the term \(Y_{t-1}\) is less than -2.89, the null \(H_0\) the process has a unit root is rejected and we decide that the AR process is stationary.
We shall apply the procedure to lnyse
. Let use begin with \(p_{max} = 4\):
mod.4 <- dynlm(d(lnyse) ~ L(lnyse) + L(d(lnyse), 1:4) + time(lnyse))
summary(mod.4)
##
## Time series regression with "ts" data:
## Start = 1952(6), End = 1995(13)
##
## Call:
## dynlm(formula = d(lnyse) ~ L(lnyse) + L(d(lnyse), 1:4) + time(lnyse))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.220902 -0.024758 0.001637 0.026627 0.152618
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.3687707 1.1485800 -2.062 0.0397 *
## L(lnyse) -0.0172567 0.0083590 -2.064 0.0395 *
## L(d(lnyse), 1:4)1 0.0544182 0.0439376 1.239 0.2161
## L(d(lnyse), 1:4)2 -0.0282788 0.0439795 -0.643 0.5205
## L(d(lnyse), 1:4)3 0.0150921 0.0439341 0.344 0.7313
## L(d(lnyse), 1:4)4 0.0364522 0.0439285 0.830 0.4070
## time(lnyse) 0.0012584 0.0006077 2.071 0.0389 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.04061 on 517 degrees of freedom
## Multiple R-squared: 0.01244, Adjusted R-squared: 0.0009837
## F-statistic: 1.086 on 6 and 517 DF, p-value: 0.3696
We see that the coefficient of L(d(lnyse), 1:4)4
is insignificant (because its p-value = 0.4070 > 0.05
). So, now, instead of L(d(lnyse), 1:4)
, we reduce the maximum order to \(p_{max} = 3\):
mod.3 <- dynlm(d(lnyse) ~ L(lnyse) + L(d(lnyse), 1:3) + time(lnyse))
summary(mod.3)
##
## Time series regression with "ts" data:
## Start = 1952(5), End = 1996(1)
##
## Call:
## dynlm(formula = d(lnyse) ~ L(lnyse) + L(d(lnyse), 1:3) + time(lnyse))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.220029 -0.025348 0.002672 0.027110 0.152142
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.2563268 1.1441167 -1.972 0.0491 *
## L(lnyse) -0.0161543 0.0083185 -1.942 0.0527 .
## L(d(lnyse), 1:3)1 0.0513588 0.0439385 1.169 0.2430
## L(d(lnyse), 1:3)2 -0.0273597 0.0439384 -0.623 0.5338
## L(d(lnyse), 1:3)3 0.0151113 0.0439494 0.344 0.7311
## time(lnyse) 0.0011980 0.0006054 1.979 0.0484 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.04065 on 519 degrees of freedom
## Multiple R-squared: 0.01048, Adjusted R-squared: 0.000945
## F-statistic: 1.099 on 5 and 519 DF, p-value: 0.3599
Coefficient of L(d(lnyse), 1:3)3
is insignificant (p-value = 0.7311 > 0.05
). So we continue the procedure (we omit the results - check them! - the coefficients of L(d(lnyse), 1:2)2
and L(d(lnyse), 1)
are insignificant):
mod.2 <- dynlm(d(lnyse) ~ L(lnyse) + L(d(lnyse), 1:2) + time(lnyse))
# summary(mod.2)
mod.1 <- dynlm(d(lnyse) ~ L(lnyse) + L(d(lnyse), 1) + time(lnyse))
# summary(mod.1)
until we get such a model:
mod.0 <- dynlm(d(lnyse) ~ L(lnyse) + time(lnyse))
summary(mod.0)
##
## Time series regression with "ts" data:
## Start = 1952(2), End = 1996(1)
##
## Call:
## dynlm(formula = d(lnyse) ~ L(lnyse) + time(lnyse))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.221927 -0.025066 0.002876 0.026943 0.147915
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.169524 1.124569 -1.929 0.0542 .
## L(lnyse) -0.015482 0.008150 -1.900 0.0580 .
## time(lnyse) 0.001152 0.000595 1.936 0.0534 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.04055 on 525 degrees of freedom
## Multiple R-squared: 0.0071, Adjusted R-squared: 0.003317
## F-statistic: 1.877 on 2 and 525 DF, p-value: 0.1541
Where we see that our linear term time(lnyse)
is not significant. So, our final model:
mod.0 <- dynlm(d(lnyse) ~ L(lnyse))
summary(mod.0)
##
## Time series regression with "ts" data:
## Start = 1952(2), End = 1996(1)
##
## Call:
## dynlm(formula = d(lnyse) ~ L(lnyse))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.225147 -0.024324 0.002601 0.025993 0.157881
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.0077315 0.0121721 0.635 0.526
## L(lnyse) -0.0001374 0.0019082 -0.072 0.943
##
## Residual standard error: 0.04065 on 526 degrees of freedom
## Multiple R-squared: 9.851e-06, Adjusted R-squared: -0.001891
## F-statistic: 0.005181 on 1 and 526 DF, p-value: 0.9426
Our final model does not contain a linear term, so the Dickey-Fuller critical value i -2.89. The t-value from our model is -0.072 > -2.89
, so we have no ground to reject \(\rho = 0\).
Thus, lnyse
has a unit root and it can be modeled as:
mod.f <- dynlm(d(lnyse) ~ 1)
summary(mod.f)
##
## Time series regression with "ts" data:
## Start = 1952(2), End = 1996(1)
##
## Call:
## dynlm(formula = d(lnyse) ~ 1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.225318 -0.024289 0.002496 0.026039 0.157962
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.006865 0.001768 3.884 0.000116 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.04062 on 527 degrees of freedom
In other words, \(lnyse_t = 0.006865 + lnyse_{t-1} + w_t\), i.e. this is a random walk with drift.
The 5% Dickey-Fuller approximate critical values were presented above. In our case, -0.072 was definitely greater than -2.89, but, generally, once the order and necessity of \(\delta t\) are established, use the Augmented Dickey-Fuller test, implemented in ur.df()
function. Recall that this test is applied in the cases:
"none"
option);"drift"
option);"trend"
option).In all three cases, we test the null hypothesis \(H_0: \phi = 1\) (this is equivalent to \(H_0: \rho = 0\)). It’s critical values depend on type
option and are equal to, respectively, tau1
, tau2
, tau3
. Note that the critical values remain the same even if the process is not AR(1), but AR(p).
suppressPackageStartupMessages({library(urca)})
lnyse.df <- ur.df(lnyse, lags = 0, type = "drift")
summary(lnyse.df)
##
## ###############################################
## # Augmented Dickey-Fuller Test Unit Root Test #
## ###############################################
##
## Test regression drift
##
##
## Call:
## lm(formula = z.diff ~ z.lag.1 + 1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.225147 -0.024324 0.002601 0.025993 0.157881
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.0077315 0.0121721 0.635 0.526
## z.lag.1 -0.0001374 0.0019082 -0.072 0.943
##
## Residual standard error: 0.04065 on 526 degrees of freedom
## Multiple R-squared: 9.851e-06, Adjusted R-squared: -0.001891
## F-statistic: 0.005181 on 1 and 526 DF, p-value: 0.9426
##
##
## Value of test-statistic is: -0.072 7.5294
##
## Critical values for test statistics:
## 1pct 5pct 10pct
## tau2 -3.43 -2.86 -2.57
## phi1 6.43 4.59 3.78
Since the Value of test-statistic is: -0.072
and is closer to zero than the 5% critical value : tau2 = -2.86 < -0.072 < 0
, there is no ground to reject the null \(H_0:\) the process has a unit root.
Testing for unit root with the sequential procedure A and procedure B last long. The same ur.df
function allows us to automate the procedure: once you choose the maximum number of lags (e.g. 4) and note to include a linear trend, the ur.df
function creates all possible models with lags = 4, 3, 2, 1, 0
and chooses the one with the least BIC. Note that this model will differ from the previous, however, it is hard to say which one is more accurate.
lnyse.df <- ur.df(lnyse, type = "trend", lags = 4, selectlags = "BIC")
summary(lnyse.df)
##
## ###############################################
## # Augmented Dickey-Fuller Test Unit Root Test #
## ###############################################
##
## Test regression trend
##
##
## Call:
## lm(formula = z.diff ~ z.lag.1 + 1 + tt + z.diff.lag)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.219863 -0.024691 0.003008 0.026555 0.153257
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.559e-02 3.915e-02 2.186 0.0292 *
## z.lag.1 -1.680e-02 8.209e-03 -2.046 0.0412 *
## tt 1.023e-04 4.983e-05 2.053 0.0406 *
## z.diff.lag 5.261e-02 4.380e-02 1.201 0.2302
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.04054 on 520 degrees of freedom
## Multiple R-squared: 0.01001, Adjusted R-squared: 0.004295
## F-statistic: 1.752 on 3 and 520 DF, p-value: 0.1554
##
##
## Value of test-statistic is: -2.0463 6.0166 2.1302
##
## Critical values for test statistics:
## 1pct 5pct 10pct
## tau3 -3.96 -3.41 -3.12
## phi2 6.09 4.68 4.03
## phi3 8.27 6.25 5.34
Value of test-statistic is: -2.0463
and Critical values for test statistics
is tau3: -3.96
. thus, the bottom line is: since the test statistic is greater than the critical value, lnyse
is a process with (a linear trend and) a unit root.
If instead of type = "trend"
, we choose type = "drift"
, once again, we get that the process has a unit root.
Thus, we have three possible models to choose from for forecasting. Without any deeper argumentation, we select the mod.a
. lnyse
is forecasted with the formula \(lnyse_t = 0.006865 + lnyse_{t-1} + w_t\): to do this in R
, use the Arima
function:
suppressPackageStartupMessages({library(forecast)})
lnyse.ur <- Arima(lnyse, c(0, 1, 0), include.drift = TRUE)
lnyse.ur
## Series: lnyse
## ARIMA(0,1,0) with drift
##
## Coefficients:
## drift
## 0.0069
## s.e. 0.0018
##
## sigma^2 estimated as 0.00165: log likelihood=942.79
## AIC=-1881.59 AICc=-1881.56 BIC=-1873.05
plot(forecast(lnyse.ur), include = 72)
We will examine quarterly long-run interest rates from 1954 Q1 to 1994 Q4:
suppressPackageStartupMessages({require(readxl)})
txt1 <- "http://uosis.mif.vu.lt/~rlapinskas/(data%20R&GRETL/"
txt2 <- "interestrates.xls"
tmp = tempfile(fileext = ".xls")
#Download the file
download.file(url = paste0(txt1, txt2), destfile = tmp, mode = "wb")
#Read it as an excel file
rate <- data.frame(read_excel(path = tmp))
rate <- ts(rate[, -1], start = 1954, frequency = 4)
Let us view the data:
plot(rate)
An initial look at the plots suggests that the data is not stationary since it is increasing throughout the period. However we cannot say if this is because of a trend (which would make this a TS series), or if this is because the series has a unit root and a drift (this would make it a DS series). [For an example comparing plots of DS with TS series see the lecture slides]
plot(diff(rate))
On the other hand, the differences seem to be varying over a constant value. What it more, there variance is not drastically changing with time, which suggests that our series might have a unit root.
We will analyse only the Longrate
variable. The analysis is identically carried out for the Shortrate
variable.
If our data has a unit root, this would mean that the coefficient \(\phi = 1\) in our equation \(Longrate_t = \alpha + \phi Longrate_{t-1} + \epsilon_t\). A good first-look can be seen by checking the Time Series Lag Plots with lag.plot
, which plots the time series against lagged versions of themselves. They help visualize auto-dependence:
Longrate <- rate[, 1]
# Shortrate<- rate[, 2]
lag.plot(Longrate)
We can see that there is a clear linear dependence between Longrate
and its first lag. However, plotting the difference lag.plot
shows us that a relationship between \(\Delta Longrate_t\) and its lag is not present when taking the first differences of the data.
lag.plot(diff(Longrate))
We can also calculate the correlation between \(Longarte_t\) and \(Longrate_{t-1}\):
cor(Longrate[-1], Longrate[-length(Longrate)])
## [1] 0.9997162
cor(diff(Longrate[-1]), diff(Longrate[-length(Longrate)]))
## [1] -0.002354711
The correlation between our initial data and its lag is almost 1, while the correlation of the differenced data is -0.00235
, which suggests that a unit root may be present in our data.
Finally, we can check the (sample) ACF plot to see if our series is stationary:
par(mfrow = c(2, 2))
Acf(Longrate)
Pacf(Longrate)
Acf(diff(Longrate))
Pacf(diff(Longrate))
While the series is not stationary, it is not clear if the slow decline is because of a trend or because of a unit root. However, the fact that first differences appear to be WN, while the original data isn’t, suggests that our data might be a DS series, i.e. it might have a unit root.
Furthermore, the first bar of the PACF plot is almost 1, which, along with slowly declining ACF, gives an indication of a unit root in our time series.
The result of this analysis allows us to guess that Longrate
, most likely, has a unit root. We repeat the procedure of the previous example with \(p_{max} - 1 = 4\) and finally end with the model
\(\Delta Y_t = \alpha + \rho Y_{t-1} + w_t\):
mod.0 <- dynlm(d(Longrate) ~ L(Longrate))
summary(mod.0)
##
## Time series regression with "ts" data:
## Start = 1954(2), End = 1994(4)
##
## Call:
## dynlm(formula = d(Longrate) ~ L(Longrate))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.027981 -0.004469 0.000053 0.004992 0.038346
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.038606 0.014392 2.682 0.00807 **
## L(Longrate) -0.003983 0.001871 -2.130 0.03473 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.009194 on 161 degrees of freedom
## Multiple R-squared: 0.0274, Adjusted R-squared: 0.02135
## F-statistic: 4.535 on 1 and 161 DF, p-value: 0.03473
It would be wrong to say that since the t-value |-2.13| > 2, we reject the null \(\rho = 0\). The right answer can be obtained with the Dickey-Fuller test:
Longrate.df <- ur.df(Longrate, 0, type = "drift")
summary(Longrate.df)
##
## ###############################################
## # Augmented Dickey-Fuller Test Unit Root Test #
## ###############################################
##
## Test regression drift
##
##
## Call:
## lm(formula = z.diff ~ z.lag.1 + 1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.027981 -0.004469 0.000053 0.004992 0.038346
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.038606 0.014392 2.682 0.00807 **
## z.lag.1 -0.003983 0.001871 -2.130 0.03473 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.009194 on 161 degrees of freedom
## Multiple R-squared: 0.0274, Adjusted R-squared: 0.02135
## F-statistic: 4.535 on 1 and 161 DF, p-value: 0.03473
##
##
## Value of test-statistic is: -2.1295 63.8967
##
## Critical values for test statistics:
## 1pct 5pct 10pct
## tau2 -3.46 -2.88 -2.57
## phi1 6.52 4.63 3.81
Because the t-value=-2.13 > -2.88
, we have no ground to reject the null, therefore, Longrate
is a random walk with drift:
Long.mdl <- Arima(Longrate, c(0, 1, 0), include.drift = TRUE)
coefficients(Long.mdl)
## drift
## 0.007994899
\[Y_t - Y_{t-1} = 0.00799 + w_t\]
We can also forecast this process using auto.arima
:
Long.auto <- auto.arima(Longrate, d = 1)
plot(forecast(Long.auto, 12), include = 20)