We say that $X$ Grager causes $Y$, if the past values of $X$ have explanatory power for current values of $Y$.
To test this, We will generate $X$ and $Y$ in the following way:
$$
\begin{cases}
X_t = 0.2 X_{t-1} + w_{t},\ w_t \sim \mathcal{N}(0,1) \\
Y_t = 0.3 Y_{t-1} - 0.5 Y_{t-2} + 0.2 Y_{t-3} - 0.2 X_{t-1} - 0.3 X_{t-2} - 0.1 X_{t-3} + \epsilon_t,\ \epsilon_t \sim \mathcal{N}(0,1)
\end{cases}
$$
#Import the required modules for data generation
import numpy as np
#Import the required modules for plot creation:
import matplotlib.pyplot as plt
#Import the required modules for DataFrame creation:
import pandas as pd
#Import the required modules for TimeSeries data generation:
import statsmodels.api as sm
#Import the required modules for test statistic calculation:
import statsmodels.stats as sm_stat
#Import the required modules for model estimation:
import statsmodels.tsa as smt
#Import the required modules for optimization:
import scipy as scipy
#We also need additional data:
import statsmodels.formula.api as smf
We will simulate for a sample size of 500
:
np.random.seed(1)
N = 500
X = np.zeros(N)
Y = np.zeros(N)
w = np.random.normal(loc = 0.0, scale = 1.0, size = N)
e = np.random.normal(loc = 0.0, scale = 1.0, size = N)
Generate $X$:
X[0] = w[0]
for j in range(1, N):
X[j] = 0.2 * X[j - 1] + w[j]
Generate $Y$:
Y[0] = e[0]
Y[1] = 0.3 * Y[0] - 0.2 * X[0] + e[1]
Y[2] = 0.3 * Y[1] - 0.5 * Y[0] - 0.2 * X[1] - 0.3 * X[0] + e[2]
for j in range(3, N):
Y[j] = 0.3 * Y[j - 1] - 0.5 * Y[j - 2] + 0.2 * Y[j - 3] - 0.2 * X[j - 1] - 0.3 * X[j - 2] - 0.1 * X[j - 3] + e[j]
DT = pd.DataFrame([X, Y], index = ["X", "Y"]).T
DT.plot(subplots=True, layout=(2, 1))
plt.show()
We will begin by estimating an unrestricted model:
$$Y_t = \alpha + \delta t + \phi_1 Y_{t-1} + ... + \phi_p Y_{t-p} + \beta_1 X_{t-1} + ... + \beta_q X_{t-q} + \epsilon_t$$
We have already learned of the sequential procedure to choose $p$ and $q$ in the model. On the other hand, in the Granger causality case, this is not the question of the first importance and so, we will assume p = q.
#Lag operator:
def L(x, n = 1):
if n == 0:
return x
if isinstance(x, pd.Series):
return x.shift(n)
else:
x = pd.Series(x)
return x.shift(n)
#Difference operator:
def d(x):
if isinstance(x, pd.Series):
return x.diff()
else:
x = pd.Series(x)
return x.diff()
#Trend/time operator:
def time(x):
return list(range(1, len(x) + 1))
We will also define a funtion to extract the coefficient table:
#Return the model coefficients of a `smf.ols(...).fit()` object:
def get_coef(mdl_fit):
result = np.round(pd.DataFrame([mdl_fit.params, mdl_fit.tvalues, mdl_fit.pvalues], index = ["coef", "t-value", "p-value"]).T, 5)
return(result)
We will begin by setting $p=q=4$:
mod_UR = smf.ols(formula = 'Y ~ 1 + time(Y) + L(Y, 1) + L(Y, 2) + L(Y, 3) + L(Y, 4) + L(X, 1) + L(X, 2) + L(X, 3) + L(X, 4)', data = DT).fit()
get_coef(mod_UR)
We then remove L(Y, 4)
and check for insignificant coefficients.
Then, we remove time(Y)
(because it is insignificant).
While we know that there is no intercept in our data generating pricess (i.e. DGP), it is worth also checking the mean of the series:
print("Series means are:\n", DT.mean(axis = 0))
print("Series variances are:\n", DT.var(axis = 0))
So, it appears that the series has a zero-mean. We can verify this via a t-test
where the null hypothesis are $H_0: \mathbb{E}(Y_t) = 0$ and $H_0: \mathbb{E}(X_t) = 0$
print(scipy.stats.ttest_1samp(DT["X"], popmean = 0))
print(scipy.stats.ttest_1samp(DT["Y"], popmean = 0))
So we do not reject the null hypothesis in either case since the p-value > 0.05
for both series. Because of this, and the fact that Intercept
coefficient is also insignificant after removing the previous two coefficients - we remove the Intercept
as well:
mod_UR = smf.ols(formula = 'Y ~ -1 + L(Y, 1) + L(Y, 2) + L(Y, 3) + L(X, 1) + L(X, 2) + L(X, 3) + L(X, 4)', data = DT).fit()
get_coef(mod_UR)
Because we want to test, if $X_{t-j},\ j > 0$ improve the model of $Y_t$, we need to estimate a restricted model, where the restriction is that $\beta_1 = ... = \beta_q = 0$ :
$$Y_t = \alpha + \delta t + \phi_1 Y_{t-1} + ... + \phi_p Y_{t-p} + \epsilon_t$$
i.e. the coefficients of $X$ Series are insignificant.
mod_R = smf.ols(formula = 'Y ~ -1 + L(Y, 1) + L(Y, 2) + L(Y, 3)', data = DT).fit()
get_coef(mod_R)
Note that we are not attempting to remove or add any coefficients to the Restricted model, as we are already imposing a restriction on $\beta_j$ coefficients, and we want to compare the $UR$ and $R$ models to see if those specific restrictions are valid.
To compare these two models, we will compare them by their SSR (sum of squared residuals, also known as residual sum of squares (RSS), or sum of squared errors of prediction (SSE)). We can do this by carrying out an ANOVA test.
sm_stat.api.anova_lm(mod_R, mod_UR)
Since p-value < 0.05
, we reject the null hypothesis $H_0: SSR_R = SSR_{UR}$. This means that the models are not identical. To check, which model is better, we can look at the $SSR$ and the $AIC$, or $BIC$ values:
pd.DataFrame([mod_R.ssr, mod_UR.ssr], index = ["Restricted", "Unrestricted"], columns = ["SSR"]).T
pd.DataFrame([mod_R.aic, mod_UR.aic], index = ["Restricted", "Unrestricted"], columns = ["AIC"]).T
pd.DataFrame([mod_R.bic, mod_UR.bic], index = ["Restricted", "Unrestricted"], columns = ["BIC"]).T
We see that in all cases the $UR$ model is better: it has a lower SSR (i.e. the model errors are smaller) and a lower $AIC$ and $BIC$ value compares to the restricted model.
So, we can conclute that $X$ Granger causes $Y$.
Now we are interested if this is the case the other way around. Remembering that we set $p = q$, we are left with the following $UR$ model for $X$:
X_UR = smf.ols(formula = 'X ~ -1 + L(Y, 1) + L(X, 1)', data = DT).fit()
get_coef(X_UR)
The restricted model:
X_R = smf.ols(formula = 'X ~ -1 + L(X, 1)', data = DT).fit()
get_coef(X_R)
Now, we carry out the ANOVA test:
sm_stat.api.anova_lm(X_R, X_UR)
Since p-value > 0.05
, we do not reject the null hypothesis $H_0: SSR_R = SSR_{UR}$.
This means that the errors of the two models are simialr in magnitude. In other words, including the additional coefficients from the lags of $Y$ series does not improve our model for $X$, as our $SSR$ is not significantly different from the model $SSR$ without the lags of $Y$.
Note that the errors themselves are not identical, i.e. $e_{t, UR} \neq e_{t, R}$, the results say that overall their squared residual sums are not statistically significantly different.
pd.DataFrame([X_R.ssr, X_UR.ssr], index = ["Restricted", "Unrestricted"], columns = ["SSR"]).T
pd.DataFrame([X_R.aic, X_UR.aic], index = ["Restricted", "Unrestricted"], columns = ["AIC"]).T
pd.DataFrame([X_R.bic, X_UR.bic], index = ["Restricted", "Unrestricted"], columns = ["BIC"]).T
The $SSR$, $AIC$ and $BIC$ values are quite similar in both the $UR$ and $R$ cases.
We can conclude that $Y$ does not Granger cause $X$.
For the sake of this example, assume that our VAR model is of the form:
$$ \begin{pmatrix}Y_{t} \\ X_{t} \end{pmatrix} = \begin{pmatrix} 3 \\ 0 \end{pmatrix} + \begin{pmatrix} 0.5 & 0.4 \\ -0.2 & 0.6 \end{pmatrix} \begin{pmatrix} Y_{t-1} \\ X_{t-1} \end{pmatrix} + \begin{pmatrix} -0.2 & -0.1 \\ 0.3 & 0.1 \end{pmatrix} \begin{pmatrix} Y_{t-2} \\ X_{t-2} \end{pmatrix} + \begin{pmatrix} \epsilon_{1,t} \\ \epsilon_{2,t} \end{pmatrix} ,\quad \begin{pmatrix} \epsilon_{1,t} \\ \epsilon_{2,t} \end{pmatrix} \sim \mathcal{N} \left( \begin{bmatrix} 0 \\ 0 \end{bmatrix} , \begin{bmatrix} 8 & 6 \\ 6 & 9 \end{bmatrix} \right) $$
i.e., we are analysing a $VAR(2)$ model with correlated residuals.
nsample = 500
alpha = np.array([3, 0])
theta_1 = np.array([[0.5, 0.4], [-0.2, 0.6]])
theta_2 = np.array([[-0.2, -0.1], [0.3, 0.1]])
mu = [0, 0]
sigma = np.array([[8, 6],[6, 9]])
Note that the covariance matrix is always both symmetric and positive semi-definite. Equivalently, its eigenvalues are non-negative. We can verify this:
print("The eigenvalues: ", np.linalg.eigvals(sigma))
print("Are they non-negative? :", np.all(np.linalg.eigvals(sigma) >= 0))
We begin by generating our shocks, $\vec{\epsilon}_t$:
np.random.seed(1)
e = np.random.multivariate_normal(mu, sigma, size = nsample)
#Transform to DataFrame for easier ploting and stat calculation:
e_test = pd.DataFrame(e, columns = ["E1", "E2"])
e_test.plot(subplots=True, layout = (2, 1))
plt.show()
# The sample covariance:
e_test.cov()
# The sample mean:
e_test.mean()
Now, we can generate the data:
Y = np.zeros(shape = (nsample, 2))
#The first 5 rows:
Y[:5, :]
When $t = 1$, we have:
Y[0] = alpha + e[0]
When $t = 2$:
Y[1] = alpha + np.dot(theta_1, Y[0]) + e[1]
When $t > 2$:
for j in range(2, nsample):
Y[j] = alpha + np.dot(theta_1, Y[j - 1]) + np.dot(theta_2, Y[j - 2]) + e[j]
The resulting series:
Y[:5, :]
Most of time series models in Python
require that the pandas DataFrame
object to have a valid date/time index, so we need to assign a date and frequency to the data - let's say that our data is daily, starting from 2014:
Y = pd.DataFrame(Y, columns = ["Y", "X"])
Y.index = pd.date_range(start = "2014-01-01", periods = len(Y.index), freq = "D").to_period()
Y.index = Y.index.to_timestamp()
Y.head()
Y.plot(subplots=True, layout = (2, 1))
plt.show()
We also want to see, if the series is $TS$ or $DS$. To do this, we can check if: $$\det (I - \Theta_1 z - \Theta_2 z^2 ) = 0 \iff |z_j | > 1,\ \forall j$$ i.e. if all the roots are outside the unit complex circle, then the process is stationary.
Fortunately, we can very easily do this in Python using symbolic expressions:
from sympy import *
z = Symbol('z')
D = Matrix([[1, 0], [0, 1]]) - Matrix(theta_1) * z - Matrix(theta_2) * (z**2)
D
Next, we calculate the determinant:
D.det()
We can also multiply the expressions in parenthesis by converting the result into a into a polynomial object:
temp_poly = Poly(D.det(), z)
temp_poly
temp_roots = roots(D.det(), z)
temp_roots
The real solution is: $$ \begin{aligned} z_1 &= 1.23662943597335\\ z_2 &= 18.7362676595998 \end{aligned} $$ The complex solution is: $$ \begin{aligned} z_3 &= 0.513551452213425 + 2.01301370614087 \cdot i\\ z_4 &= 0.513551452213425 - 2.01301370614087 \cdot i \end{aligned} $$
The additional values are their multiplicities. Also, more root calculation examples using python.
To calculate the absolute values, we need to extract the roots:
temp_roots = [*temp_roots]
temp_roots
And use the Abs()
function from sympy
. We also use N(expr, 4)
function to round the sympy
results (which are a sympy
object and not an ordinary value) to 4 decimal precision:
for j in temp_roots:
print("z = ", N(j, 4), "=> |z| = ", N(Abs(j), 4))
So, all the roots are outside the unit circle, which means that this $VAR$ model is stationary.
Note that in general we do not know the underlying DGP, so we would need to perform the $ADF$ or $KPSS$ tests on each variable to verify whether they have a unit root or not.
A VAR model can be stimated using the documented functions. It is worth reading the documentation as it not only provides examples on using the provided functions, but sometimes gives additional guidelines. For example:
The VAR class assumes that the passed time series are stationary.
Non-stationary or trending data can often be transformed to be stationary by first-differencing or some other method.
For direct analysis of non-stationary time series, a standard stable VAR(p) model is not appropriate.
Y_VAR = smt.api.VAR(Y)
Choosing the lag order can be a difficult problem. Fortunately, we can use the automated procedure, which is based on calculating the information criteria for different lag orders:
Note: in case of errors like this one:
2-th leading minor of the array is not positive definite
make sure that the lag order is not too large (or the data sample too small). Also, this can happen if some roots $|z_j|$ are very close to 1 (but still greater than 1).
Y_VAR.select_order(10)
We see that the recommended model is $VAR(2)$ according to $AIC$, $BIC$, $HQIC$ and $FPE$ information criteria.
We will estimate a $VAR(2)$ model:
results = Y_VAR.fit(2)
results.summary()
We can also examine the error covariance matrix:
results.resid.cov()
results.resid.mean()
The estimated model is: $$ \begin{cases} \hat{Y}_t &= 2.694126 + 0.589969 \hat{Y}_{t-1} + 0.370707 \hat{X}_{t-1} -0.255685 \hat{Y}_{t-2} -0.075897 \hat{X}_{t-2} \\ \hat{X}_t &= -0.359460 -0.074297 \hat{Y}_{t-1} + 0.565941 \hat{X}_{t-1} + 0.242305 \hat{Y}_{t-2} + 0.099620 \hat{X}_{t-2} \end{cases} $$
The model residual covariance matrix is:
$$ \hat{\Sigma}_\epsilon = \begin{bmatrix} 7.208196 & 5.598939 \\ 5.598939 & 8.906885 \end{bmatrix} $$
Compared to the true model:
$$\begin{pmatrix} Y_{t} \\ X_{t} \end{pmatrix} = \begin{pmatrix} 3 \\ 0 \end{pmatrix} + \begin{pmatrix} 0.5 & 0.4 \\ -0.2 & 0.6 \end{pmatrix} \begin{pmatrix} Y_{t-1} \\ X_{t-1} \end{pmatrix} + \begin{pmatrix} -0.2 & -0.1 \\ 0.3 & 0.1 \end{pmatrix} \begin{pmatrix} Y_{t-2} \\ X_{t-2} \end{pmatrix} + \begin{pmatrix} \epsilon_{1,t} \\ \epsilon_{2,t} \end{pmatrix} ,\quad \begin{pmatrix} \epsilon_{1,t} \\ \epsilon_{2,t} \end{pmatrix} \sim \mathcal{N} \left( \begin{bmatrix} 0 \\ 0 \end{bmatrix} , \begin{bmatrix} 8 & 6 \\ 6 & 9 \end{bmatrix} \right) $$
We see that, for the most part, the estimated parameters are close to the true parameter values. The same can be said for the error covariance matrix.
From the estimated VAR model, we can plot the impulse response functions of the endogenous variables.
Impulse responses are of interest in econometric studies: they are the estimated responses to a unit impulse in one of the variables. They are computed in practice using the $MA(\infty)$ representation of the $VAR(p$) process: $$ \vec{Y}_t = \vec{\mu} + \sum_{i=0}^\infty \Phi_i \vec{\epsilon}_{t-i}, \quad \vec{\epsilon}_t \sim \mathcal{N}(\vec{0}, \hat{\Sigma}_\epsilon) $$ Orthogonalization is done using the Cholesky decomposition of the estimated error covariance matrix $\hat{\Sigma}_\epsilon$ and hence interpretations may change depending on variable ordering.
irf = results.irf(15)
We can compare the standard deviation shock to the non-orthogonalized residuals:
irf.plot(orth = False)
To the unit shocks to the orthogonalized residuals:
irf.plot(orth = True)
Note that the main difference is the magnitude of the impulses and impulse of $Y$ and the response of $Y$ and $X$.
The cumulative effects $\Psi_n = \sum_{i=0}^n \Phi_i$ (i.e. where each period is the sum of the current and previous period response values) can be plotted with the long run effects:
irf.plot_cum_effects(orth = False)
results.plot_forecast(10)
plt.xlim(xmin = 425) # we can subset the last variable forecast plot
We can also extract the forecast and confidence intervals manually:
forecasts = results.forecast_interval(Y.as_matrix(), 10)
forecasts
#mean forecasts:
forc = forecasts[0]
#lower confidence bound:
l_bound = forecasts[1]
#upper confidence bound:
u_bound = forecasts[2]
#Combine into a single one:
tmp_forc = np.concatenate((forecasts), axis=1)
#Create an empty data frame:
Y_forc = np.empty((len(Y.index), len(Y.columns)*3))
Y_forc[:] = np.nan
Y_forc = pd.DataFrame(np.append(Y_forc, tmp_forc, axis = 0), columns = ["Y_forc", "X_forc", "Y_forc_lower", "X_forc_lower", "Y_forc_upp", "X_forc_upp"])
Y_forc.index = pd.date_range(start = "2014-01-01", periods = len(Y_forc.index), freq = "D").to_period()
Y_forc.index = Y_forc.index.to_timestamp()
Y_forc.tail(11)
#Additionally, include the last historical data for nicer looking plots:
#We take the last row of our historical DataFrame
# and assign those values to the same row (using the date index) of our forecast DataFrame:
Y_forc.loc[Y.tail(1).index, ["Y_forc", "X_forc"]] = Y.loc[Y.tail(1).index].as_matrix()
Y_forc.tail(11)
#Subplot layout = 2 rows and 1 column - we only need one index
f, axarr = plt.subplots(2, 1) # if we had layout = 2 rows and 2 columns - we would need to use [i, j] indexes to access different subplots
#Size of the figure output: - also works with plt.subplots(2, 1, figsize = (10, 5))
f.set_figheight(5)
f.set_figwidth(10)
#Plot the first series:
axarr[0].plot(Y["Y"], "g-", label = "Y")
axarr[0].plot(Y_forc[["Y_forc"]], "b--", label = "Forecast")
axarr[0].plot(Y_forc[["Y_forc_lower"]], "r:", label = "Forc 0.95 STD err")
axarr[0].plot(Y_forc[["Y_forc_upp"]], "r:")
#Change the range to show the historic values from 2015 January:
axarr[0].set_xlim(xmin = pd.to_datetime("2015-03-01"))
#Add a legend:
axarr[0].legend()
#Plot the second series:
axarr[1].plot(Y["X"], "c-", label = "X")
axarr[1].plot(Y_forc[["X_forc"]], "b--", label = "Forecast")
axarr[1].plot(Y_forc[["X_forc_lower"]], "r:", label = "Forc 0.95 STD err")
axarr[1].plot(Y_forc[["X_forc_upp"]], "r:")
#Change the range to show the historic values from 2015 January:
axarr[1].set_xlim(xmin = pd.to_datetime("2015-03-01"))
#Add a legend:
axarr[1].legend()
plt.tight_layout()
results.resid.plot(subplots = True)
plt.show()
fig = plt.figure(figsize = (10, 5))
results.resid["Y"].plot(ax = fig.add_subplot(211))
plt.title("VAR model residuals of Y")
sm.graphics.tsa.plot_acf(results.resid["Y"], lags = 20, zero = False, ax = fig.add_subplot(223))
sm.graphics.tsa.plot_pacf(results.resid["Y"], lags = 20, zero = False, ax = fig.add_subplot(224))
plt.tight_layout()
plt.show()
fig = plt.figure(figsize = (10, 5))
results.resid["X"].plot(ax = fig.add_subplot(211))
plt.title("VAR model residuals of X")
sm.graphics.tsa.plot_acf(results.resid["X"], lags = 20, zero = False, ax = fig.add_subplot(223))
sm.graphics.tsa.plot_pacf(results.resid["X"], lags = 20, zero = False, ax = fig.add_subplot(224))
plt.tight_layout()
plt.show()
From the $ACF$ and $PACF$ plots we can say that the residuals are white noise. We can conslude that the specified model is acceptible for our data sample.
In general Error Correction Models (ECM's) are useful for estimating both the short-term and the long-term effects of either a univariate of multivariate time series. The error-correction
term relates to the fact that the last-period's deviation from the long-run equilibrium, the error, influences its short-run dynamics. So ECMS's directly estimate the speed of which a dependent variable returns to equilibrium after a change in other variables.
In the multivariate case, a VECM is just a representation fo a cointegrated VAR. This means that we can specify a cointegrated VAR in the form of a VECM and vice versa. The VECM representation also impses restrictions of the coefficients, which are based on the number of cointegrating relationships. These restrictions lead to more efficient coefficient estimates of the VECM, compared to the VAR representation (which does not impose those restrictions).
We will examine the following VAR(1) model:
$$ \begin{pmatrix} Y_{1,t} \\ Y_{2, t} \end{pmatrix} = \begin{pmatrix} 1 \\ 2 \end{pmatrix} + \begin{pmatrix} 0.5 & -1 \\ -0.25 & 0.5 \end{pmatrix} \begin{pmatrix} Y_{1, t-1} \\ Y_{2, t-1} \end{pmatrix} + \begin{pmatrix} \epsilon_{1,t} \\ \epsilon_{2,t} \end{pmatrix}, \quad \begin{pmatrix} \epsilon_{1,t} \\ \epsilon_{2,t} \end{pmatrix} \sim \mathcal{N} \left( \begin{pmatrix} 0 \\ 0 \end{pmatrix} , \begin{pmatrix} 1 & 0 \\ 0 & 1 \end{pmatrix} \right) $$
nsample = 500
alpha = np.array([1, 2])
theta_1 = np.array([[0.5, -1], [-0.25, 0.5]])
mu = [0, 0]
sigma = np.array([[1, 0],[0, 1]])
We begin by checking the stability (which implies stationarity) of the model:
z = Symbol('z')
D = Matrix([[1, 0], [0, 1]]) - Matrix(theta_1) * z
temp_roots = roots(D.det(), z)
temp_roots = [*temp_roots]
for j in temp_roots:
print("z = ", N(j, 4), "=> |z| = ", N(Abs(j), 4))
We see that this VAR model has a unit root and thus is not stationary.
Next, we need to estimate the number of cointegrating relations in the process. If there are none - we differentiate the data and estiamte a VAR model on the (stationary) differences. Otherwise, we need to include an error correction term and work with a VECM.
We subtract $(Y_{1,t-1},\ Y_{2,t-1})^\top$ from both sides of the the VAR model to get:
$$ \begin{pmatrix} \Delta Y_{1,t} \\ \Delta Y_{2, t} \end{pmatrix} = \begin{pmatrix} 1 \\ 2 \end{pmatrix} + \begin{pmatrix} -0.5 & -1 \\ -0.25 & -0.5 \end{pmatrix} \begin{pmatrix} Y_{1, t-1} \\ Y_{2, t-1} \end{pmatrix} + \begin{pmatrix} \epsilon_{1,t} \\ \epsilon_{2,t} \end{pmatrix} $$
where
$$\Pi = \Theta_1 - \mathbf{I} = \begin{pmatrix} 0.5 & -1 \\ -0.25 & 0.5 \end{pmatrix} - \begin{pmatrix} 1 & 0 \\ 0 & 1 \end{pmatrix} = \begin{pmatrix} -0.5 & -1 \\ -0.25 & -0.5 \end{pmatrix}$$
PI = Matrix(theta_1) - Matrix([[1, 0], [0, 1]])
PI
The rank of $\Pi$ matrix indicates the number of cointegrating relations:
PI.rank()
The VAR model has one cointegrating relation. Then, $\Pi$ can be written as $\Pi_{2 \times 2} = \alpha_{2 \times 1} \beta^\top _{1 \times 2}$.
In general if we have a $d$ -dimensional VAR (so that $\Pi$ is $d \times d$) with $r$ linearly independent cointegrating relations (so that $rank(\Pi) = r$), then: $\Pi_{d\times d} = \alpha_{d \times r} \beta^\top _{r \times d}$. In this case the rows of $\beta$ are cointegration vectors. The matrix $\beta^\top$ is therefore called the cointegrating matrix, and the matrix $\alpha$ is sometimes referred to as the loading matrix.
Note that since cointegration relationships are not unique, $\beta$ is not unique. As a matter of convention, we typically normalize the coefficients of $\beta$ such that one of the elements of $\beta$ is one (when $r = 1$, otherwise we need to normalize in such a way that $\beta = ( \mathbf{I}_{r \times r}, \beta_{(d-r) \times r} )^\top$).
So, a solution for $\Pi = \alpha \beta^\top$ is:
$$ \begin{pmatrix} -0.5 & -1 \\ -0.25 & -0.5 \end{pmatrix} = \begin{pmatrix} \alpha_1 \\ \alpha_2 \end{pmatrix} \begin{pmatrix} 1 & \beta_2 \end{pmatrix} $$
Because of the restriction of a normalization on $\beta$, we have: $\alpha_1 = -0.5, \alpha_2 = -0.25 \Rightarrow \beta_2 = 2$. And we get $\alpha = (-0.5, -0.25)^\top$ and $\beta^\top = (1, 2)$.
We can substitute these values in the VECM to get:
$$ \begin{pmatrix} \Delta Y_{1,t} \\ \Delta Y_{2, t} \end{pmatrix} = \begin{pmatrix} 1 \\ 2 \end{pmatrix} + \begin{pmatrix} -0.5 \\ -0.25 \end{pmatrix} \begin{pmatrix} 1 & 2 \end{pmatrix} \begin{pmatrix} Y_{1, t-1} \\ Y_{2, t-1} \end{pmatrix} + \begin{pmatrix} \epsilon_{1,t} \\ \epsilon_{2,t} \end{pmatrix} $$
and multiplying out $\beta$ gives us:
$$ \begin{pmatrix} \Delta Y_{1,t} \\ \Delta Y_{2, t} \end{pmatrix} = \begin{pmatrix} 1 \\ 2 \end{pmatrix} + \begin{pmatrix} -0.5 \\ -0.25 \end{pmatrix} \begin{pmatrix} Y_{1, t-1} + 2 Y_{2, t-1} \end{pmatrix} + \begin{pmatrix} \epsilon_{1,t} \\ \epsilon_{2,t} \end{pmatrix} $$
Since the differences give us a $VAR(0)$ model, both $\Delta \vec{Y}_t$ and $\vec{\epsilon}_t$ are stationary, so the linear combination $Y_{1, t-1} + 2 Y_{2, t-1}$, which appears in both $\Delta Y_{1,t}$ and $\Delta Y_{2, t}$ equations, is also stationary.
The component $Y_{1, t-1} + 2 Y_{2, t-1}$ is our the cointegration relation.
Currently $VECM$ estimation using the statsmodels package is not available. However, $VECM$ implementation in this package is being developed on github and will hopefully be included in the next version of the package.