4.1 Chapter Exercises

You are encouraged to find any other data, which may be interesting to you. The data that will be used in this section is simulated. The reason being - specific model properties, as well as R and Python library capabilities can be explored much easier. After more advanced time series topics are covered, it will become easier to analyse real-world datasets.

4.1.1 Time Series Datasets

Below we list some select time series processes, which you can examine.

There are various time-series data available, we will mainly use the datasets from:

In addition, some packages are needed in order to load and prepare the data:

#Import the required modules for vectors and matrix operations, data generation
import numpy as np
#Import the required modules for plot creation:
import matplotlib.pyplot as plt
#import the required modules for TimeSeries data generation:
import statsmodels.api as sm
#Import the required modules for test statistic calculation:
import statsmodels.stats as sm_stat
#Import the required modules for model estimation:
import statsmodels.tsa as smt
# Import pandas dataset
import pandas as pd

4.1.1.1 Dataset 1 (Simulation)

Generate the following process with linear trend and correlated errors: \[Y_t = -7 + 0.3 \cdot t + 5 e_t, \text{ where } e_t = 0.88 e_{t-1} - 0.53 e_{t-2} + w_t,\ t = 1,...,150\]

4.1.1.2 Dataset 2 (Simulation)

Generate a Random Walk with drift \(0.1\) (\(T = 150\)).

Considerations

Until 1982, when time series data and their analysis were published, most economists believed that all data time series were \(TS\) (i.e. after removing the trend, they became stationary). Nelson and Plosser proved that most of economics series were \(DS\) (i.e. their differences were stationary). Verify whether this is true on some real-world sample data, provided below.

4.1.1.3 Dataset 3

Fourteen U.S. economic time series data from 1860 to 1970. See the documentation for the variable descriptions:

data(nporg, package = "urca")

Take gnp.r and cpi and examine these series separately.

4.1.1.4 Dataset 4

UK data frame of quarterly data ranging from 1955:Q1 until 1984:Q4. The data are expressed in natural logarithms:

  • consl - The log of total real consumption in the U.K.
  • incl - The log of real disposable income in the U.K.
data(UKconinc, package = "urca")

Examine these series separately.

4.1.1.5 Dataset 5

UK data frame of quarterly data ranging from 1957:Q1 until 1975:Q4:

  • cons - Consumers non-durable expenditure in the U.K. in 1970 prices.
  • inc - Personal disposable income in the U.K. in 1970 prices.
  • price- Consumers expenditure deflator index, 1970 = 100.
data(UKconsumption, package = "urca")

Examine these series separately.

4.1.1.6 Dataset 6

A dataset of the number of users logged on to an internet server each minute over a 100-minute period.

internet <- stats::ts(c(88, 84, 85, 85, 84, 85, 83, 85, 88, 89, 91, 99,
                        104, 112, 126, 138, 146, 151, 150, 148, 147, 149, 143, 132, 131,
                        139, 147, 150, 148, 145, 140, 134, 131, 131, 129, 126, 126, 132,
                        137, 140, 142, 150, 159, 167, 170, 171, 172, 172, 174, 175, 172,
                        172, 174, 174, 169, 165, 156, 142, 131, 121, 112, 104, 102, 99,
                        99, 95, 88, 84, 84, 87, 89, 88, 85, 86, 89, 91, 91, 94, 101, 110,
                        121, 135, 145, 149, 156, 165, 171, 175, 177, 182, 193, 204, 208,
                        210, 215, 222, 228, 226, 222, 220), s = 1, f = 1)

4.1.1.7 Dataset 7

Price of chicken in US (constant dollars): 1924–1993.

chicken <- stats::ts(c(164.16, 169.17, 180.65, 168.30, 180.73, 192.55,
                       159.43, 150.11, 126.05, 106.08, 119.92, 157.06, 156.59, 161.21,
                       151.94, 137.47, 134.10, 153.25, 166.02, 203.24, 194.83, 208.18,
                       204.40, 171.61, 180.87, 154.12, 133.40, 139.22, 120.43, 119.53,
                       90.41, 100.48, 85.16, 70.41, 70.04, 54.59, 59.59, 48.84, 48.78,
                       47.25, 42.90, 40.80, 43.23, 34.23, 34.09, 38.27, 33.90, 27.48,
                       31.12, 49.16, 28.44, 26.60, 33.02, 29.34, 27.49, 27.67, 19.29,
                       17.65, 15.43, 18.43, 22.12, 19.88, 16.48, 14.00, 11.25, 17.38,
                       16.45, 15.69, 15.25, 14.64), s = 1924, f = 1)

4.1.1.8 Dataset 8

Daily air quality measurements in New York, May to September 1973.

airquality <- datasets::airquality

Examine the Temp (temperature) variable.

4.1.1.9 Dataset 9

New York Stock Exchange (NYSE) data:

nyse <- read.table(url("http://uosis.mif.vu.lt/~rlapinskas/(data%20R&GRETL/nyse.txt"), header = TRUE)
nyse <- ts(nyse, start = 1952, frequency = 12)

Note: It may very well be the case that some (or even all) of the data do not have unit roots. The idea is to carry out the unit root testing and model building procedures, as you would when working with any other empirical data.

4.1.2 Tasks

The following are universal for all datasets. This is in order to highlight that in time series analysis, regardless of what the true underlying process is, we still follow the same steps to carry out our analysis.

4.1.2.1 Exercise Set 1: Exploratory Data Analysis (EDA)

  1. Plot the series - do they appear stationary. Do they appear to exhibit exponential changes? If needed, transform the series. Continue working with the transformed data.

  2. Plot their \(\rm ACF\) and \(\rm PACF\) - does the series appear to be correlated?

4.1.2.2 Exercise Set 2: Unit Root Testing

  1. Carry out a unit root test two ways:

    • Manually (i.e. sequentially) by using dynlm (in R) to estimate the relevant models for unit root testing. Don’t forget to write down the null hypothesis for the unit root test.
    • Use the built-in functions to carry out ADF, KPSS and PP tests. Write down the null hypothesis;
  2. Depending on the results, transform the series to induce stationarity. Test whether the transformed series is indeed stationary and examine its \(\rm ACF\) and \(\rm PACF\) plots.

4.1.2.3 Exercise Set 3: Model Specification

  1. Select the appropriate model for the series either manually, or using an automated procedure (remember the drawback of automated procedures - if needed restrict the maximum number of differences and seasonal/nonseasonal lag orders).

  2. Write down the model equation for \(\Delta Y_t\) and the equation for \(Y_t\) (Note: for R you are free to use either dynlm or Arima to specify your model equation as long as it is the one you used in the previous tasks. You can also use the auto.arima documentation on its authors website for a more general model formula using the autocorrelation parameter lag functions.)

4.1.2.4 Exercise Set 4: Forecasting The Future

  1. Calculate the \(10\)-step ahead forecasts for the original series.

4.1.2.5 Exercise Set 5: Model Cross Validation

  1. Carry out cross-validation for one-step-ahead forecasts of your specified model. You can do this by creating between \(k\) sub-samples of your original series. Carry out cross-validation as follows:

    • Create \(k\) different samples: \((Y_1, ..., Y_{T-k-1})\), \((Y_1,..., Y_{T-k})\), …, \((Y_1, ..., Y_{t-1})\). The samples have an expanding window - the previous sample is a subset of the next sample.
    • Assume that the model, estimated in Task 5, is appropriate for your series. Re-estimate that same model on each subsample and calculate its one-step ahead forecast;
    • Calculate the error \(e_i\) between the true value and its one-step ahead forecast \(e_i = Y_i - \widehat{Y}_i\), where \(Y\) is the series, which was used to estimate the model. Do this on each subsample;
    • Save the model coefficient estimates for each subsample.
    • Calculate the \(RMSE = \sqrt{\dfrac{1}{k} \sum_{i = 1}^k e_i^2}\) and compare with the \(\rm RMSE\) from the model in Task 5 - are they close? If they are - what does it say about your model?
    • Remember that for time series cross-validation the first subset is the smallest, while the last subset is the largest. This means that as we move to larger subsets, our coefficient estimates should (hopefully) converge to their true values. Plot the model coefficient estimates against the data subsample index - do they exhibit large changes as the sample size increases? If so, what does this say about the stability of our model (i.e. could we say that our estimated model will be just as accurate in the future, when we have more data?)

Note: Select \(k \in\{5, ..., 20\}\), depending on your sample size - for example, if your sample size is \(T = 100\), we could select the \(k = 20\), to create \(20\) sub-samples, where the first (and smallest) sample will have \(T-k-1 = 79\) observations.