2.1 General Concepts
In this subsection we will present some basic definitions and concepts for time series, some of which will be also used in later chapters.
2.1.1 What is a Time Series
A time series is a set of observations \(y_t\), collected at specific points in time and ordered by time. Here the index \(t\) indicates the time period that the value \(y_t\) was observed at. \(t\) can also be thought of as an index which is used to sort the data: \(y_1, y_2, ...\). Time series data can be observed at many frequencies. The most common are:
Name | Frequency (observations per year) |
---|---|
Annual | once a year |
Quarterly | 4 observations per year |
Monthly | 12 observations per year |
Weekly | usually a fixed number, e.g. 52 observations per year |
Daily | can be a fixed number, e.g. a commercial year is a 360-day year composed of 12 months of 30 days - 360 observations per year |
On the other hand, the observations \(\{..., y_0, y_1, y_2, ...\}\) are realizations of random variables \(\{..., Y_0, Y_1, Y_2, ...\}\) with joint distribution \(H(..., Y_0, Y_1, Y_2,...) = \mathbb{P}(..., Y_0 \leq y_0, Y_1 \leq y_1, Y_2 \leq y_2,...)\). To keep notation simple, we will use an upper-case notation for both random variables and the realizations. It will be clear from the context whether we are talking about random variables or their realizations.
We can write the time series compactly as a random variable sequence \(\{Y_t, t =0, \pm 1, \pm 2, ...\}\). We will usually re-index the time series to have non-negative indexes: \(Y_t = 0\), \(\forall t < 0\), so our random variable sequence is \(\{Y_t, t = 0, 1, 2, ...\}\). We will say that \(Y_0\) is a starting value of the time series, though we will usually set \(Y_0 = 0\).
- In cross-sectional data different observations were assumed to be uncorrelated;
- In time series we require that there be some dynamics, some persistence, some way in which the present is linked to the past and the future - to the present. Having historical data then would allow us to forecast the future.
Because of the time-persistence property of a time series, we need to define operators which would let us operate on an element of a time series to produce the previous element.
Unless stated otherwise, we will restrict ourselves to the discrete-time setting, i.e. for \(t \in \mathbb{Z}\).
2.1.2 The Lag Operator
The lag operator \(L\) is used to lag a time series: \(LY_t = Y_{t-1}\). Similarly: \(L^2 Y_t = L(LY_t) = Y_{t-2}\), etc. In general, we can write: \[L^p Y_t = Y_{t-p}\] Typically, we operate on a time series with a polynomial in the lag operator. A lag operator polynomial of degree \(m\) is: \[B(L) = \beta_0 + \beta_1 L + \beta_2 L^2 + ... + \beta_m L^m\]
Example 2.1 If \(B(L) = 1 + 0.9L -0.6 L^2\), then:
\[B(L)Y_t = Y_t + 0.9 Y_{t-1} -0.6 Y_{t-2}\]We can also write an infinite-order lag operator polynomial as: \[B(L) = \beta0 + \beta_1 L + \beta_2 L^2 + ... = \sum_{j = 0}^\infty \beta_j L^j, \quad L^0 = 1\]
2.1.3 Difference Operator
The first-difference operator \(\Delta\) is a first-order polynomial in the lag operator: \[\Delta Y_t = Y_t - Y_{t-1} = (1-L)Y_t\] In other words, we have defined the first difference operator as the lag polynomial function with \(\beta_1 = -1\), so that \(B(L) = 1 - L =: \Delta\).
Powers of \(\Delta\) are: \[ \begin{aligned} \Delta^2 Y_t &= \Delta (\Delta Y_t)= (1-L)(1-L)Y_t = \left(1-2L+L^2 \right) Y_t = Y_t - 2Y_{t-1} + Y_{t-2} \end{aligned} \] and similarly, for \(\Delta^k Y_t\), \(k > 2\).
2.1.4 (Weakly) Stationary Time Series
Let \(\{Y_t\}\) be a time series with \(\mathbb{E} (Y_t^2) < \infty\). Then:
The mean function of \(\{Y_t\}\) is: \[\mu_Y (t) = \mathbb{E}(Y_t)\]
The covariance function of \(\{Y_t\}\) is: \[\gamma_Y (r, s) = \text{Cov}(X_r, X_s) = \mathbb{E} \left[ \left( Y_r - \mu_Y (r) \right) \left( Y_s - \mu_Y (s) \right) \right], \quad \forall r, s \in \mathbb{Z}\]
Let \(Y_1, ..., Y_T\) be observations of a time series. The sample mean is: \[\overline{Y} = \dfrac{1}{T} \sum_{t = 1}^T Y_t\]
\(\{Y_t\}\) is (weakly) stationary if:
- \(\mu_Y (t) = \mu_Y (s) = \mu_Y, \forall t,s \in \mathbb{Z}\), i.e. the mean is independent of \(t\);
- \(\gamma_Y \left( s, r \right) = \gamma_Y \left( t+s, t+r \right)\), \(\forall r,s,t \in \mathbb{Z}\), i.e. the covariance is independent of \(t\). The last property is equivalent to \(\gamma_Y \left( t, t-h \right) = \gamma_Y \left( h, 0 \right) = \gamma_Y \left( h \right)\), \(\forall t,h \in \mathbb{Z}\)
The definition is equivalent to three separate stationarity conditions:
- If \(\mathbb{E}(Y_t) = \mu\) - the process is called mean-stationary;
- If \(\mathbb{V}{\rm ar}(Y_t) = \sigma^2 \lt \infty\) - the process is called variance-stationary;
- If \(\gamma_Y(t, t-h) = \gamma_Y(h)\) - the process is called covariance-stationary.
In other words, a time series \({Y_t}\) is stationary if its mean, variance and covariance does not depend on t.
If at least one of the three requirements is not met, then the process is called non-stationary.
A weakly stationary process with zero-mean and uncorrelated random variables is called a White Noise process, which will be presented in subsection ??.
2.1.5 Strictly Stationary Time Series
\(\{ Y_t \}\) is strictly stationary time series if: \[\left( Y_1, Y_2, ..., Y_T \right)^\top \stackrel{d}{=} \left( Y_{1+h}, Y_{2+h}, ..., Y_{T+h} \right)^\top, \quad \forall T \geq 1, \forall h \in \mathbb{Z}\] Here \(\stackrel{d}{=}\) is used to indicate that two random vectors have the same joint distribution function.
Properties of a Strictly Stationary Time Series \(\{ Y_t \}\):
- The random variables \(Y_t\) are identically distributed;
- \(\left( Y_{t}, Y_{t+h} \right)' \stackrel{d}{=} \left( Y_{s}, Y_{s+h} \right)'\), \(\forall s, t, h\);
- \(\{ Y_t \}\) is weakly stationary if \(\mathbb{E}\left( Y_t^2 \right) \leq \infty\), \(\forall t\);
- Weak stationarity does not imply strict stationarity;
- An i.i.d. sequence is strictly stationary.
Another way to look at it is:
- A sequence of uncorrelated random variables (with constant mean and variance) is called a weakly stationary process;
- A sequence of independent random variables (with constant mean and variance) is called a strongly stationary process;
2.1.6 Autocovariance Function
Let \(\{Y_t \}\) be a stationary time series. The autocovariance function of \(\{ Y_t \}\) at lag \(h\) is: \[\gamma_Y\left( h \right) = \mathbb{C}{\rm ov}\left( Y_{t}, Y_{t-h} \right)\]
The basic properties of \(\gamma(\cdot)\):
- \(\gamma(0) \geq 0\);
- \(|\gamma(h)| \leq \gamma(0), \forall h\);
- \(\gamma(h) = \mathbb{C}{\rm ov}\left( Y_{t}, Y_{t-h} \right) = \mathbb{C}{\rm ov} \left( Y_{t-h}, Y_{t} \right) = \gamma(-h)\), \(\forall h\), i.e. \(\gamma(\cdot)\) is even.
Let \(Y_1, ..., Y_T\) be observations of a time series. The sample autocovariance function is: \[\hat{\gamma}\left( h \right) = \dfrac{1}{T}\sum_{t=1+|h|}^{T}\left( Y_{t} - \overline{Y} \right) \left( Y_{t-|h|} - \overline{Y}\right), \quad -T < h < T\]
2.1.7 Autocorrelation Function
Let \(\{Y_t \}\) be a stationary time series. The autocorrelation function (ACF) of \(\{ Y_t \}\) at lag \(h\) is: \[\rho_Y \left( h \right) = \dfrac{\gamma_Y \left( h \right)}{\gamma_Y \left( 0 \right)} = \mathbb{C}{\rm orr}\left(Y_{t}, Y_{t-h} \right) = \dfrac{\mathbb{C}{\rm ov}(Y_t, Y_{t-h})}{\sqrt{\mathbb{V}{\rm ar}(Y_{t})\mathbb{V}{\rm ar}(Y_{t-h})}}\]
To assess the degree of dependence in the data and to select a model for the data that reflects this, one of the important tools we use is the sample autocorrelation function (sample ACF) of the data.
Let \(Y_1, ..., Y_T\) be observations of a time series. The sample autocorrelation function is: \[\hat{\rho}\left( h \right) = \dfrac{\hat{\gamma}(h)}{\hat{\gamma}(0)} = \dfrac{\dfrac{1}{T}\sum_{t=1+|h|}^{T}\left( Y_{t} - \overline{Y} \right) \left( Y_{t-|h|} - \overline{Y}\right)}{\dfrac{1}{T}\sum_{t=1}^{T}\left( Y_{t} - \overline{Y}\right)^2}, \quad -T < h < T\]
The sample ACF will provide us with an estimate of the ACF of \(Y_t\). If the sequence \(\{Y_t \}\) is a standard normal independent and identically distributed (i.i.d. or iid or IID) random variable process, then \(\rho(h) = 0, \quad \forall h > 0\), so we would expect the sample autocorrelations to be near 0 as well. It can be shown that for i.i.d. variables the sample ACFs \(\hat{\rho}(h), \quad h > 0\) are approximately i.i.d \(\mathcal{N}\left( 0, 1/T \right)\).
So, the 95% confidence interval (of an i.i.d. time series) is: \[0 \pm \dfrac{1.96}{\sqrt{T}}\] (i.e. 95% of the sample autocorrelations of an i.i.d. process should fall between these bounds).
In general, the critical value of a standard normal distribution and its confidence interval can be found in these steps:
- Compute \(\alpha = (1-Q)/2\), where \(Q\) is the confidence level;
- Find the \(z_{1-\alpha}\) in order to express the critical value as a z-score.
For example, if we want to find the z-score of a 95% confidence level, then \(Q = 0.95\), so \(\alpha = 0.05\). then, the standard normal distributions \(1-\alpha\) quantile is \(z_{0.025} \approx 1.96\).
For linear models \(\hat{\mathbf{\rho}}_k = \left( \hat{\rho}(1), ..., \hat{\rho}(k) \right)'\) is approximately distributed (for large T) as \(\mathcal{N}\left( \mathbf{\rho}_k, T^{-1}W \right)\), i.e. \(\hat{\mathbf{\rho}} \approx \mathcal{N}\left( \mathbf{\rho}, T^{-1}W \right)\), where \(\mathbf{\rho} = \left( \rho(1), ..., \rho(k) \right)'\) and \(W\) is the covariance matrix, whose elements are given by Bartlett’s formula: \[ \begin{aligned} w_{ij} &= \sum_{k = 1}^{\infty} \left[ \rho(k+i) + \rho(k-i) - 2\rho(i)\rho(k) \right] \cdot \left[ \rho(k+j) + \rho(k-j) - 2\rho(j)\rho(k) \right] \\ w_{ii} &= \sum_{k = 1}^{\infty} \left[ \rho(k+i) + \rho(k-i) - 2\rho(i)\rho(k) \right]^2 \end{aligned} \]
So, the 95% confidence interval/bounds (of a specific linear time series model) is: \[\rho(i) \pm \dfrac{1.96 \cdot w_{ii}^{1/2}}{\sqrt{T}}\]
Where \({\rho}(i)\) is the ACF and the value \(w_{ii}\) is calculated for a specific model which is assumed to be the underlying data generating process (DGP). Because the true value of \(\rho(\cdot)\) is not known in practice, the sample autocorrelations \(\hat{\rho}(\cdot)\) would be used in in order to check the hypothesis that the data are generated by a specific linear model. The confidence bound would then be compared with the theoretical ACF values. If the theoretical ACF is very close to the 95% confidence bound or outside it - then a different model may provide a much better fit than the current one.
This, however, assumes that we know the underlying model with the actual coefficients. Because this is not the case in practical applications, the usual hypothesis is comparisons with a White Noise process, i.e. by checking the bounds \(0 \pm 1.96/\sqrt{T}\).
Note that the difference between the two confidence intervals is the hypothesis that we are testing - in the first case we compare the sample ACF with the confidence bounds of a White Noise (i.e. i.i.d.) process, while in the second case, we compare the sample ACF with the confidence bounds of a specific linear time series model - like the ones discussed in sections ??, ?? and ??.
2.1.8 Partial Autocorrelation Function
Given a time series \(Y_t\), the partial autocorrelation function (PACF) of lag \(k\), \(\alpha(k)\) is the autocorrelation between \(Y_t\) and \(Y_{t-k}\), with the linear dependence of \(Y_t\) on \(Y_{t-1},..., Y_{t-k+1}\) removed. We begin by specifying the linear regression up to the last lag \(k\): \[ \begin{aligned} Y_{t} &= \beta_0 + \beta_1 Y_{t-1} + ... + \beta_{k} Y_{t-k}\\ \end{aligned} \] Then, we can the PACF is the: \[ \begin{aligned} \alpha(0) &= \mathbb{C}{\rm orr}\left(Y_{t}, Y_t \right) = 1 \\ \alpha(k) &= \beta_k, \quad k \geq 1 \end{aligned} \]
Note that for different PACF lags, we need to re-specify the regression where the relevant lag is the last coefficient. The lags up to \(k\) are only specified in order to remove the linear dependence between \(Y_t\) and \(Y_{t-1},..., Y_{t-k+1}\).
The 95% confidence interval is the same as for the ACF: \[0 \pm \dfrac{1.96}{\sqrt{T}}\]
2.1.9 Testing for ACF significance
We are often interested in whether a series is uncorrelated, i.e. whether all of its autocorrelations are jointly zero. Because of the sample size, we can only take a finite number of autocorrelations. we want to test the null hypothesis: \[ \begin{cases} H_0&: \rho(1)=0, \rho(2) = 0,..., \rho(k) = 0 \\\\ H_1&: \exists j \in \{1,..,k\} \text{ such that } \rho(j) \neq 0 \end{cases} \] The test statistics:
- Ljung-Box test statistic: \[Q = T(T+2)\sum_{\tau = 1}^k \dfrac{\hat{\rho}^2(\tau)}{T - \tau}\]
- Box-Pierce test statistic: \[Q = T\sum_{\tau = 1}^k \hat{\rho}^2(\tau)\]
are calculated differently but have the same critical region - if \(Q > \chi_{K}^2\) (i.e. if the p-value is less than the critical value), we reject the null hypothesis of uncorrelated variables.
2.1.10 The Wold Decomposition
The time series \({Y_t}\) is a linear process if it has the following representation: \[ Y_t = \sum_{j = -\infty}^\infty \psi_j \epsilon_{t-j},\ \sum_{j = -\infty}^\infty |\psi_j| < \infty, \ \epsilon_t \sim WN(0, \sigma^2) \]
Where \(WN\) is the White Noise process, which will be presented in detail in subsection ??.
An approximation using the infinite-order lag operator is known as Wold’s Representation Theorem:
\[ Y_t = B(L) \epsilon_t = \sum_{j=0}^\infty \beta_j \epsilon_{t-j}, \quad \epsilon_t \sim WN(0, \sigma^2) \] where \(\beta_0 = 1\) and \(\sum_{j = 0}^\infty \beta_j ^2 < \infty\). Furthermore:
- Any process of the above form is stationary;
- If \(\beta_1 = \beta_2 = ... = 0\) - this corresponds to a \(WN\) process. This shows again than \(WN\) is a stationary process.
- If \(\beta_k = \phi^k\), then \(1 + \phi + \phi^2 + ... = 1/(1-\phi) < \infty \iff |\phi| < 1\). Then the process \({Y_t = \epsilon_t + \phi \epsilon_{t-1} + \phi^2 \epsilon_{t-2} + ...}\) is a stationary process.
In Wold’s theorem, we assumed a zero mean, though this is not as restrictive as it may seem. Whenever you see \(Y_t\), analyse the process \(Y_t - \mu\), so that the process is expressed in deviations from its mean. The deviation from the mean has a zero mean by construction. So, there is not generality loss, when analyzing zero-mean processes.
Wold’s representation theorem points to the importance of models with infinite distributed (weighted) lags. Although infinite distributed lag models are not of immediate practical use since they contain infinite parameters, although this may not always be the case - as the example with \(\beta_k = \phi^k\) - the infinite polynomial \(B(L)\) has only one parameter \(\phi\).
Note that sometimes Wolds Representation Theorem has an additional deterministic component. Deterministic components are discussed in Chapter 3.