ggplot()
and reshaping data with melt()
and dcast()
The default plotting functions in R
are somewhat barebones in the sense that in order to plot multiple graphs or multiple variables requires additional function calls and additional variable creation to get the desired effect. The complexity depends on the dataset used.
As such, it is recommended to plot panel data with ggplot()
from package ggplot2
. However, ggplot()
requires the dataset to be in a specific format.
In general, data can be categorized into wide format and long format data.
The wide format is the format that is usually encountered while working with spreadsheets:
suppressPackageStartupMessages({library(plm)})
data("SumHes")
head(SumHes)
## year country opec com pop gdp sr
## 1 1960 ALGERIA no no 10800 1723 19.9
## 2 1961 ALGERIA no no 11016 1599 21.1
## 3 1962 ALGERIA no no 11236 1275 15.0
## 4 1963 ALGERIA no no 11460 1517 13.9
## 5 1964 ALGERIA no no 11690 1589 10.6
## 6 1965 ALGERIA no no 11923 1584 11.0
#DT::datatable() works in HTML format only
DT::datatable(data.frame(Country = unique(SumHes$country)))
unique(SumHes$year)
## [1] 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973
## [15] 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985
Here the dataset can be summarized by two types of information:
indexes (or, identities) - we have an array of countries and an array years.s;
pop
- the country’s population (in thousands);gdp
- real GDP per capita (in 1985 USD)sr
- saving rate, %Note that variables like opec
(OPEC member indicator) and com
(communist regime indicator) can also be used as identities in combination with other index variables.
In general, let’s say that we want to plot the logarithm of population of the first 10 countries. If we wanted to use the default plot()
function, this would require us to subset the correct columns, subset the data by country and use a different color for each country. The code could look something like this:
temp_data = SumHes[, c("country", "year", "pop")]
temp_data$pop = log(temp_data$pop)
colnames(temp_data)[3] <- "log_pop"
cnt = unique(temp_data$country)[1:10]
temp_data <- temp_data[temp_data$country %in% cnt, ]
for(j in 1:length(cnt)){
if(j == 1){
plot(temp_data[temp_data$country == cnt[j], c("year", "log_pop")],
type = "l", col = j, ylim = c(min(temp_data$log_pop), max(temp_data$log_pop)))
}else{
lines(temp_data[temp_data$country == cnt[j], c("year", "log_pop")], col = j)
}
}
But what if we also want to add a legend? And what if we want to add it outside of the plot? Then, we would need to add more code:
# Add extra space to right of plot area; change clipping to figure
par(mar = c(5.1, 4.1, 4.1, 14.1), xpd = TRUE)
#### Previous code:
for(j in 1:length(cnt)){
if(j == 1){
plot(temp_data[temp_data$country == cnt[j], c("year", "log_pop")], main = "log population",
type = "l", col = j, ylim = c(min(temp_data$log_pop), max(temp_data$log_pop)))
}else{
lines(temp_data[temp_data$country == cnt[j], c("year", "log_pop")], col = j)
}
}
#### End of previous code
legend("topright", inset = c(-0.7, 0),
legend = cnt, lty = 1,
col = 1:length(cnt), title = "Country")
Note that we had to use par()
to manually adjust the plotting parameters. by setting the last coordinate in mar
to 14.1
, we reduced the width of the plot to have some space left for the legend. In the legend()
we used inset
to specify the x-axis of where the legend should be placed. Those values depend on the values we set in mar
.
Overall, quite a lot of code was needed for a clear presentation of the data. If we wanted to split the data by an additional variable into two plots, lets say the variable is com
- and indicator whether the country has a communist regime. then we would need to manually review and set the layout for two plots. If there would be more than two regimes, then we would again need to manually specify the layout. Not to mention the fact that we would also need a different number of columns.
For this case, it is best to transform the data into a long
format.
The long format has the following structure:
variable
column, which has the names of all the different variables;value
column, which has the value of the variable for that particular ID.Note that the value
column is limited (at least in R
) to a specific data type - numbers and indicator (integer) values can be in the same column, but not strings or characters (if need be, those strings would need to be converted to integer indicators).
We can use the melt
function from the reshape2
package to convert the data into the long format. Unfortunately, we cannot have different variable types in the same column, so, we will leave opec
and com
in their separate columns as well (this will be useful if we want to further group the data by their regime and OPEC membership status).
suppressPackageStartupMessages({library(reshape2)})
SumHes_long = melt(SumHes, id.vars = c("year", "country", "opec", "com"))
## Warning in melt_dataframe(data, as.integer(id.ind - 1),
## as.integer(measure.ind - : '.Random.seed' is not an integer vector but of
## type 'NULL', so ignored
head(SumHes_long)
## year country opec com variable value
## 1 1960 ALGERIA no no pop 10800
## 2 1961 ALGERIA no no pop 11016
## 3 1962 ALGERIA no no pop 11236
## 4 1963 ALGERIA no no pop 11460
## 5 1964 ALGERIA no no pop 11690
## 6 1965 ALGERIA no no pop 11923
unique(SumHes_long$variable)
## [1] pop gdp sr
## Levels: pop gdp sr
In other words, the data is reshaped in a way that we can filter the relevant variables by their row, rather than column.
Now, we can use ggplot()
and many other functions from ggplot2
package to plot our data in a more intuitive way. To make the process easier to understand we will add more elements to the plot as we more along.
To start with, we specify the data, x and y axis and any other variables, which we can use to group the data:
pop
ulation and we will only use the first 10 countries;suppressWarnings({
suppressPackageStartupMessages({library(ggplot2)})
})
p <- ggplot(SumHes_long[SumHes_long$variable == "pop" &
SumHes_long$country %in% cnt , ],
aes(x = year, y = value, group = country))
print(p)
We see that we have an empty plot. We now need to add the geometric lines. We can specify to color them differently, depending on the country
. Notice that we do not need to create new variables, or use quotes for the column names:
p <- p + geom_line(aes(col = country))
print(p)
The legend is generated automatically. The plots also look better and this was accomplished in less lines of code!
Furthermore, we can directly take logarithms of the values:
tmp_data <- SumHes_long[SumHes_long$variable == "pop" &
SumHes_long$country %in% cnt , ]
p <- ggplot(tmp_data,
aes(x = year, y = log(value), group = country))
p <- p + geom_line(aes(col = country))
print(p)
Now assume further that we want to plot more than one variable. If we try to use the same code as before:
p2 <- ggplot(SumHes_long[SumHes_long$country %in% cnt, ],
aes(x = year, y = value, group = country))
p2 <- p2 + geom_line(aes(col = country))
print(p2)
We see a mishmash of different lines. This is because we did not group our data by these additional characteristics. We can do this two ways - one way is to combine the variables with the countries to create a new identity - this can be done within the plot code:
p2 <- ggplot(SumHes_long[SumHes_long$country %in% cnt, ],
aes(x = year, y = value,
group = interaction(country, variable)))
p2 <- p2 + geom_line(aes(col = country))
p2 <- p2 + geom_point(aes(shape = variable))
print(p2)
Note that we can combine the color and shape and specify them in the ggplot()
itself:
p2 <- ggplot(SumHes_long[SumHes_long$country %in% cnt, ],
aes(x = year, y = value,
color = country,
shape = variable,
group = interaction(country, variable)))
p2 <- p2 + geom_line() + geom_point()
print(p2)
Now we can identify the country and (somewhat) identify the variable.
Another way, which is more relevant to this data example, is to divide a single plot into separate plots (this can be done on any number of attributes). Let’s try to do that with the variables. We also want to have separate axis for each plot, so we specify scales = free_y
:
p2 <- ggplot(SumHes_long[SumHes_long$country %in% cnt, ],
aes(x = year, y = value,
color = country,
shape = variable,
group = interaction(country, variable)))
p2 <- p2 + geom_line() + geom_point()
p2 <- p2 + facet_wrap(~variable, nrow = 2, scales = "free_y")
print(p2)
Now we can clearly identify the variables and examine how they changed in different countries.
We present another example, by selecting a couple of countries with different com
and opec
values. Note that they are factors and need to be transformed into characters to avoid any errors:
SumHes_long$country <- as.character(SumHes_long$country)
SumHes_long$opec <- as.character(SumHes_long$opec)
SumHes_long$com <- as.character(SumHes_long$com)
cnt_2 <- c(unique(SumHes_long$country[SumHes_long$opec == "yes" & SumHes_long$com == "yes"])[1:4],
unique(SumHes_long$country[SumHes_long$opec == "yes" & SumHes_long$com == "no"])[1:4],
unique(SumHes_long$country[SumHes_long$opec == "no" & SumHes_long$com == "yes"])[1:4],
unique(SumHes_long$country[SumHes_long$opec == "no" & SumHes_long$com == "no"])[1:4])
#Remove any NA resutlts, i.e. if no country had both of those values:
cnt_2 <- cnt_2[!is.na(cnt_2)]
#Subset the data for easier code readability
new_data = SumHes_long[SumHes_long$country %in% cnt_2, ]
new_data <- new_data[new_data$variable == "pop", ]
To make the variables easier to understand, we can change them:
new_data$opec[new_data$opec == "yes"] <- "Member of OPEC"
new_data$opec[new_data$opec == "no"] <- "Not a member of OPEC"
new_data$com[new_data$com == "yes"] <- "Communist regime"
new_data$com[new_data$com == "no"] <- "Non-comunist regime"
DT::datatable(head(new_data))
p2 <- ggplot(new_data,
aes(x = year, y = value,
color = country, group = country))
p2 <- p2 + geom_line()
p2 <- p2 + facet_wrap(~ com + opec, nrow = 2, scales = "free_y")
p2 <- p2 + ggtitle("Variable: pop")
p2 <- p2 + ylab("Population (thousands)")
print(p2)
This would have required much more work using the regular plot()
function.
These are only a couple of examples. More can be found:
Let’s say that we have finished working with the data and we want to transform it back from long to wide. Fortunately, there is the dcast()
function in the dplyr
package:
suppressPackageStartupMessages({library(dplyr)})
SumHes_wide <- dcast(SumHes_long,
formula = year + country + opec + com ~ variable)
head(SumHes_wide)
## year country opec com pop gdp sr
## 1 1960 ALGERIA no no 10800 1723 19.9
## 2 1960 ANGOLA no no 4816 931 3.3
## 3 1960 ARGENTINA no no 20618 4462 17.4
## 4 1960 AUSTRALIA no no 10274 7782 30.7
## 5 1960 AUSTRIA no no 7048 5143 24.3
## 6 1960 BANGLADESH no no 52357 952 5.0
We specify all the variables on the left side, the columns which are used to ungroup the data from the variable
column, which we specified on the right side of the formula. By default the value is located in the value
column. If you have a different column, use value.var
to specify its name.
We will analyse panel data on 11 large US manufacturing firms over 20 years, for the years 1935-1954.
suppressPackageStartupMessages({
library(plm)
library(gplots)
library(ggplot2)
})
data("Grunfeld", package = "plm")
print(head(Grunfeld), row.names = FALSE)
## firm year inv value capital
## 1 1935 317.6 3078.5 2.8
## 1 1936 391.8 4661.7 52.6
## 1 1937 410.6 5387.1 156.9
## 1 1938 257.7 2792.2 209.2
## 1 1939 330.8 4313.2 203.4
## 1 1940 461.2 4643.9 207.2
where:
inv
- gross Investment;value
- value of the firm;capital
- stock of plant and equipment.Let’s begin by looking at the data plots:
p <- ggplot(data = Grunfeld)
p <- p + geom_line(aes(y = inv, x = year, col = factor(firm)))
p <- p + facet_wrap(~ firm, scales = "free")
p <- p + theme(legend.position="none")
print(p)
As well as the summary statistics:
aggregate(inv ~ firm, data = Grunfeld, summary)
## firm inv.Min. inv.1st Qu. inv.Median inv.Mean inv.3rd Qu. inv.Max.
## 1 1 257.7000 438.6500 538.3500 608.0200 654.2000 1486.7000
## 2 2 209.9000 338.5250 419.5500 410.4750 470.6250 645.5000
## 3 3 33.1000 60.1750 93.5500 102.2900 146.5250 189.6000
## 4 4 40.2900 57.7800 71.0850 86.1235 92.1850 174.9300
## 5 5 39.6700 51.9225 60.3850 61.8025 71.3150 91.9000
## 6 6 20.3600 27.7625 43.1100 55.4110 70.4550 135.7200
## 7 7 23.2100 33.4775 44.2000 47.5955 54.1350 89.5100
## 8 8 12.9300 31.1725 38.5400 42.8915 53.6900 90.0800
## 9 9 20.8900 30.1825 38.1100 41.8890 54.8625 66.1100
## 10 10 0.9300 1.9575 2.2150 3.0845 4.3250 6.5300
aggregate(inv ~ firm, data = Grunfeld, var)
## firm inv
## 1 1 95836.450105
## 2 2 15725.016711
## 3 3 2360.453579
## 4 4 1825.473056
## 5 5 230.035820
## 6 6 1221.307936
## 7 7 335.464321
## 8 8 365.199308
## 9 9 221.449283
## 10 10 2.953794
we see that the general scale and variance of investments is different throughout the firms.
As such, we may assume that each firm has a specific, unobserved fixed (i.e. constant throughout time) effect.
We can also examine the presence of heterogeneity across countries and years:
par(mfrow = c(2, 1))
suppressWarnings({
plotmeans(inv ~ factor(firm), main = "Heterogeneity across countries", data = Grunfeld)
plotmeans(inv ~ year, main = "Heterogeneity across years", data = Grunfeld)
})
From the first plot we see that the firms (named 1 through 10) have different averages. This is especially true for the first two and the last one firm. This indicates that the firms do differ from one another.
From the second plot we see that the average investment, calculated across all ten firms, is different each year, which indicates that the investment average varies with time, so a time series model is appropriate.
Note that because of different scales coplot()
does not produce an informative plot:
coplot(inv ~ year| factor(firm), type = "b", data = Grunfeld)
We wanted to visualize how inv
and year
vary with firm
, but because of different firm sizes, the scales are not adjusted.. To remedy this, we can take the logs of the data:
coplot(log(inv) ~ year| factor(firm), type = "b", data = Grunfeld, columns = 10)
Note that the first graph shows the range of the observations for the conditioning variable (factor(firm)
). As we read left to right, we see that the investments decrease for the different firms - in this case, as the firm ID increases (i.e. when we are looking at a different firm), the the investment amount decreases. Firm 1
has the largest investment amounts and firm 10
- the lowest.
Another short example on coplots can be found here.
We will estimate the following model on our panel data:
eq <- inv ~ value + capital
we would assume that an increase in firm value
and capital
would lead to an increase in investment
, regardless of country-specific properties.
pooled = plm(eq, data = Grunfeld, index = c("firm", "year"), model = "pooling")
summary(pooled)
## Pooling Model
##
## Call:
## plm(formula = eq, data = Grunfeld, model = "pooling", index = c("firm",
## "year"))
##
## Balanced Panel: n = 10, T = 20, N = 200
##
## Residuals:
## Min. 1st Qu. Median 3rd Qu. Max.
## -291.6757 -30.0137 5.3033 34.8293 369.4464
##
## Coefficients:
## Estimate Std. Error t-value Pr(>|t|)
## (Intercept) -42.7143694 9.5116760 -4.4907 1.207e-05 ***
## value 0.1155622 0.0058357 19.8026 < 2.2e-16 ***
## capital 0.2306785 0.0254758 9.0548 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Total Sum of Squares: 9359900
## Residual Sum of Squares: 1755900
## R-Squared: 0.81241
## Adj. R-Squared: 0.8105
## F-statistic: 426.576 on 2 and 197 DF, p-value: < 2.22e-16
fixed = plm(eq, data = Grunfeld, index = c("firm", "year"), model = "within")
summary(fixed)
## Oneway (individual) effect Within Model
##
## Call:
## plm(formula = eq, data = Grunfeld, model = "within", index = c("firm",
## "year"))
##
## Balanced Panel: n = 10, T = 20, N = 200
##
## Residuals:
## Min. 1st Qu. Median 3rd Qu. Max.
## -184.00857 -17.64316 0.56337 19.19222 250.70974
##
## Coefficients:
## Estimate Std. Error t-value Pr(>|t|)
## value 0.110124 0.011857 9.2879 < 2.2e-16 ***
## capital 0.310065 0.017355 17.8666 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Total Sum of Squares: 2244400
## Residual Sum of Squares: 523480
## R-Squared: 0.76676
## Adj. R-Squared: 0.75311
## F-statistic: 309.014 on 2 and 188 DF, p-value: < 2.22e-16
We want to test whether in our model the unobserved fixed effects are equal to zero, i.e. whether they are equal across all firms (this would lead to a pooled model).
Here, the model under the null hypothesis (i.e. restricted model): \[ y_{it} = x_{it}'\beta + u_{it} \] The model under the alternative hypothesis (i.e. unrestricted model): \[ y_{it} = x_{it}'\beta + c_i + u_{it} \] The null hypothesis is: \[H_0: c_i = 0,\quad \forall i = 1,...,N\] Rejecting this hypothesis means that the fixed effects are non-zero. This can be done via an \(F\) test by comparing fixed effects estimates to the ones from the pooled regression:
\[ F = \dfrac{(\text{SSE}_r - \text{SSE}_u)/\text{df}_1}{\text{SSE}_u / \text{df}_2} \ sim F(\text{df}_1, \text{df}_2) \]
Where:
fixef(fixed)
## 1 2 3 4 5 6
## -70.296717 101.905814 -235.571841 -27.809295 -114.616813 -23.161295
## 7 8 9 10
## -66.553474 -57.545657 -87.222272 -6.567844
pFtest(fixed, pooled)
##
## F test for individual effects
##
## data: eq
## F = 49.177, df1 = 9, df2 = 188, p-value < 2.2e-16
## alternative hypothesis: significant effects
We reject the null hypothesis, so the fixed effects are significant.
random = plm(eq, data = Grunfeld, index = c("firm", "year"), model = "random")
summary(random)
## Oneway (individual) effect Random Effect Model
## (Swamy-Arora's transformation)
##
## Call:
## plm(formula = eq, data = Grunfeld, model = "random", index = c("firm",
## "year"))
##
## Balanced Panel: n = 10, T = 20, N = 200
##
## Effects:
## var std.dev share
## idiosyncratic 2784.46 52.77 0.282
## individual 7089.80 84.20 0.718
## theta: 0.8612
##
## Residuals:
## Min. 1st Qu. Median 3rd Qu. Max.
## -177.6063 -19.7350 4.6851 19.5105 252.8743
##
## Coefficients:
## Estimate Std. Error t-value Pr(>|t|)
## (Intercept) -57.834415 28.898935 -2.0013 0.04674 *
## value 0.109781 0.010493 10.4627 < 2e-16 ***
## capital 0.308113 0.017180 17.9339 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Total Sum of Squares: 2381400
## Residual Sum of Squares: 548900
## R-Squared: 0.7695
## Adj. R-Squared: 0.76716
## F-statistic: 328.837 on 2 and 197 DF, p-value: < 2.22e-16
Now we will test if we should use RE model (null hypothesis) or FE model (alternative hypothesis) specification via a Hausman test.
phtest(fixed, random)
##
## Hausman Test
##
## data: eq
## chisq = 2.3304, df = 2, p-value = 0.3119
## alternative hypothesis: one model is inconsistent
The alternative is specified as one model is inconsistent
. FE is consistent under both the null (but not as efficient) and the alternative, while RE is inconsistent under the alternative.
p-value = 0.3119 > 0.05
, so we do not reject the null hypothesis that means that both models are consistent. This means that the RE model is more efficient.
We will look at different model fits:
new.dt <- data.frame(Grunfeld)[, c("firm", "year", "inv")]
#Change the name of the last column
colnames(new.dt)[3] <- "Y.Historical"
#Save firm order :
firm_names <- paste0("Firm #", sort(unique(new.dt$firm)))
#Change firm names :
new.dt$firm <- paste0("Firm #", new.dt$firm)
new.dt <- data.frame(new.dt,
Y.POLS = as.numeric(pooled$model[[1]] - pooled$residuals),
Y.FE = as.numeric(fixed$model[[1]] - fixed$residuals),
Y.RE = as.numeric(random$model[[1]] - random$residuals))
new.dt$firm <- factor(new.dt$firm, levels = firm_names)
new.dt <- melt(new.dt, id.vars = c("firm", "year"))
p <- ggplot(data = new.dt,
aes(x = year, y = value, group = interaction(variable, firm)))
p <- p + geom_line(aes(color = variable, linetype = variable))
p <- p + facet_wrap(~ firm, scales = "free")
p <- p + theme(legend.position="bottom")
p <- p + ggtitle("Gross Investment for different firms: Historical, POLS, FE and RE")
print(p)
Note that the fitted()
and predict()
functions either will not work, or return completely different results. As such, in order to calculate the fitted values, we need to calculate Y - residuals(...)
by extracting them from the model.