Plotting data with `ggplot()` and reshaping data with `melt()` and `dcast()`

The default plotting functions in R are somewhat barebones in the sense that in order to plot multiple graphs or multiple variables requires additional function calls and additional variable creation to get the desired effect. The complexity depends on the dataset used.

As such, it is recommended to plot panel data with ggplot() from package ggplot2. However, ggplot() requires the dataset to be in a specific format.

In general, data can be categorized into wide format and long format data.

Wide Format Data

The wide format is the format that is usually encountered while working with spreadsheets:

suppressPackageStartupMessages({library(plm)})

data("SumHes")
head(SumHes)

##   year country opec com   pop  gdp   sr
## 1 1960 ALGERIA   no  no 10800 1723 19.9
## 2 1961 ALGERIA   no  no 11016 1599 21.1
## 3 1962 ALGERIA   no  no 11236 1275 15.0
## 4 1963 ALGERIA   no  no 11460 1517 13.9
## 5 1964 ALGERIA   no  no 11690 1589 10.6
## 6 1965 ALGERIA   no  no 11923 1584 11.0

#DT::datatable() works in HTML format only
DT::datatable(data.frame(Country = unique(SumHes$country)))

unique(SumHes$year)

##  [1] 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973
## [15] 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985

Here the dataset can be summarized by two types of information:

indexes (or, identities) - we have an array of countries and an array years.s;
values - the observed values of different variables - in this case some of the variables are:
- pop - the country’s population (in thousands);
- gdp - real GDP per capita (in 1985 USD)
- sr - saving rate, %
- etc.

Note that variables like opec (OPEC member indicator) and com (communist regime indicator) can also be used as identities in combination with other index variables.

In general, let’s say that we want to plot the logarithm of population of the first 10 countries. If we wanted to use the default plot() function, this would require us to subset the correct columns, subset the data by country and use a different color for each country. The code could look something like this:

temp_data = SumHes[, c("country", "year", "pop")]
temp_data$pop = log(temp_data$pop)

colnames(temp_data)[3] <- "log_pop"

cnt = unique(temp_data$country)[1:10]
temp_data <- temp_data[temp_data$country %in% cnt, ]


for(j in 1:length(cnt)){
  if(j == 1){
    plot(temp_data[temp_data$country == cnt[j], c("year", "log_pop")], 
         type = "l", col = j, ylim = c(min(temp_data$log_pop), max(temp_data$log_pop)))
  }else{
    lines(temp_data[temp_data$country == cnt[j], c("year", "log_pop")], col = j)
  }
}

But what if we also want to add a legend? And what if we want to add it outside of the plot? Then, we would need to add more code:

# Add extra space to right of plot area; change clipping to figure
par(mar = c(5.1, 4.1, 4.1, 14.1), xpd = TRUE)

#### Previous code:
for(j in 1:length(cnt)){
  if(j == 1){
    plot(temp_data[temp_data$country == cnt[j], c("year", "log_pop")], main = "log population",
         type = "l", col = j, ylim = c(min(temp_data$log_pop), max(temp_data$log_pop)))
  }else{
    lines(temp_data[temp_data$country == cnt[j], c("year", "log_pop")], col = j)
  }
}
#### End of previous code

legend("topright", inset = c(-0.7, 0), 
       legend = cnt, lty = 1, 
       col = 1:length(cnt), title = "Country")

Note that we had to use par() to manually adjust the plotting parameters. by setting the last coordinate in mar to 14.1, we reduced the width of the plot to have some space left for the legend. In the legend() we used inset to specify the x-axis of where the legend should be placed. Those values depend on the values we set in mar.

Overall, quite a lot of code was needed for a clear presentation of the data. If we wanted to split the data by an additional variable into two plots, lets say the variable is com - and indicator whether the country has a communist regime. then we would need to manually review and set the layout for two plots. If there would be more than two regimes, then we would again need to manually specify the layout. Not to mention the fact that we would also need a different number of columns.

For this case, it is best to transform the data into a long format.

Long Format Data

The long format has the following structure:

One or more ID columns (i.e. the indexes; any other variables we may want to use for data grouping)
The variable column, which has the names of all the different variables;
The value column, which has the value of the variable for that particular ID.

Note that the value column is limited (at least in R) to a specific data type - numbers and indicator (integer) values can be in the same column, but not strings or characters (if need be, those strings would need to be converted to integer indicators).

Converting from Wide to Long Format

We can use the melt function from the reshape2 package to convert the data into the long format. Unfortunately, we cannot have different variable types in the same column, so, we will leave opec and com in their separate columns as well (this will be useful if we want to further group the data by their regime and OPEC membership status).

suppressPackageStartupMessages({library(reshape2)})

SumHes_long = melt(SumHes, id.vars = c("year", "country", "opec", "com"))

## Warning in melt_dataframe(data, as.integer(id.ind - 1),
## as.integer(measure.ind - : '.Random.seed' is not an integer vector but of
## type 'NULL', so ignored

head(SumHes_long)

##   year country opec com variable value
## 1 1960 ALGERIA   no  no      pop 10800
## 2 1961 ALGERIA   no  no      pop 11016
## 3 1962 ALGERIA   no  no      pop 11236
## 4 1963 ALGERIA   no  no      pop 11460
## 5 1964 ALGERIA   no  no      pop 11690
## 6 1965 ALGERIA   no  no      pop 11923

unique(SumHes_long$variable)

## [1] pop gdp sr 
## Levels: pop gdp sr

In other words, the data is reshaped in a way that we can filter the relevant variables by their row, rather than column.

Now, we can use ggplot() and many other functions from ggplot2 package to plot our data in a more intuitive way. To make the process easier to understand we will add more elements to the plot as we more along.

To start with, we specify the data, x and y axis and any other variables, which we can use to group the data:

for this first example, we will subset the data to only have the population and we will only use the first 10 countries;
x-axis will be the years;
y-axis will have the values;
group - we indicate the variable, which will be used to separate (group) the data points

suppressWarnings({
  suppressPackageStartupMessages({library(ggplot2)})
})

p <- ggplot(SumHes_long[SumHes_long$variable == "pop" &
                          SumHes_long$country %in% cnt , ], 
            aes(x = year, y = value, group = country))
print(p)

We see that we have an empty plot. We now need to add the geometric lines. We can specify to color them differently, depending on the country. Notice that we do not need to create new variables, or use quotes for the column names:

p <- p + geom_line(aes(col = country))
print(p)

The legend is generated automatically. The plots also look better and this was accomplished in less lines of code!

Furthermore, we can directly take logarithms of the values:

tmp_data <- SumHes_long[SumHes_long$variable == "pop" &
                          SumHes_long$country %in% cnt , ]

p <- ggplot(tmp_data,
            aes(x = year, y = log(value), group = country))
p <- p + geom_line(aes(col = country))
print(p)

Now assume further that we want to plot more than one variable. If we try to use the same code as before:

p2 <- ggplot(SumHes_long[SumHes_long$country %in% cnt, ], 
            aes(x = year, y = value, group = country))
p2 <- p2 + geom_line(aes(col = country))
print(p2)

We see a mishmash of different lines. This is because we did not group our data by these additional characteristics. We can do this two ways - one way is to combine the variables with the countries to create a new identity - this can be done within the plot code:

p2 <- ggplot(SumHes_long[SumHes_long$country %in% cnt, ], 
            aes(x = year, y = value, 
                group = interaction(country, variable)))
p2 <- p2 + geom_line(aes(col = country))
p2 <- p2 + geom_point(aes(shape = variable))
print(p2)

Note that we can combine the color and shape and specify them in the ggplot() itself:

p2 <- ggplot(SumHes_long[SumHes_long$country %in% cnt, ], 
            aes(x = year, y = value, 
                color = country,
                shape = variable,
                group = interaction(country, variable)))
p2 <- p2 + geom_line() + geom_point()
print(p2)

Now we can identify the country and (somewhat) identify the variable.

Another way, which is more relevant to this data example, is to divide a single plot into separate plots (this can be done on any number of attributes). Let’s try to do that with the variables. We also want to have separate axis for each plot, so we specify scales = free_y:

p2 <- ggplot(SumHes_long[SumHes_long$country %in% cnt, ], 
            aes(x = year, y = value, 
                color = country,
                shape = variable,
                group = interaction(country, variable)))
p2 <- p2 + geom_line() + geom_point()
p2 <- p2 + facet_wrap(~variable, nrow = 2, scales = "free_y")
print(p2)

Now we can clearly identify the variables and examine how they changed in different countries.

We present another example, by selecting a couple of countries with different com and opec values. Note that they are factors and need to be transformed into characters to avoid any errors:

SumHes_long$country <- as.character(SumHes_long$country)
SumHes_long$opec <- as.character(SumHes_long$opec)
SumHes_long$com  <- as.character(SumHes_long$com)

cnt_2 <- c(unique(SumHes_long$country[SumHes_long$opec == "yes" & SumHes_long$com == "yes"])[1:4],
           unique(SumHes_long$country[SumHes_long$opec == "yes" & SumHes_long$com == "no"])[1:4],
           unique(SumHes_long$country[SumHes_long$opec == "no" & SumHes_long$com == "yes"])[1:4],
           unique(SumHes_long$country[SumHes_long$opec == "no" & SumHes_long$com == "no"])[1:4])
#Remove any NA resutlts, i.e. if no country had both of those values:
cnt_2 <- cnt_2[!is.na(cnt_2)]
#Subset the data for easier code readability
new_data = SumHes_long[SumHes_long$country %in% cnt_2, ]
new_data <- new_data[new_data$variable == "pop", ]

To make the variables easier to understand, we can change them:

new_data$opec[new_data$opec == "yes"] <- "Member of OPEC"
new_data$opec[new_data$opec == "no"] <- "Not a member of OPEC"

new_data$com[new_data$com == "yes"] <- "Communist regime"
new_data$com[new_data$com == "no"] <- "Non-comunist regime"

DT::datatable(head(new_data))

p2 <- ggplot(new_data, 
            aes(x = year, y = value,
                color = country,  group = country))
p2 <- p2 + geom_line()
p2 <- p2 + facet_wrap(~ com + opec, nrow = 2, scales = "free_y")
p2 <- p2 + ggtitle("Variable: pop")
p2 <- p2 + ylab("Population (thousands)")
print(p2)

This would have required much more work using the regular plot() function.

These are only a couple of examples. More can be found:

Converting from Long to Wide Format

Let’s say that we have finished working with the data and we want to transform it back from long to wide. Fortunately, there is the dcast() function in the dplyr package:

suppressPackageStartupMessages({library(dplyr)})

SumHes_wide <- dcast(SumHes_long, 
                     formula = year + country + opec + com ~ variable)
head(SumHes_wide)

##   year    country opec com   pop  gdp   sr
## 1 1960    ALGERIA   no  no 10800 1723 19.9
## 2 1960     ANGOLA   no  no  4816  931  3.3
## 3 1960  ARGENTINA   no  no 20618 4462 17.4
## 4 1960  AUSTRALIA   no  no 10274 7782 30.7
## 5 1960    AUSTRIA   no  no  7048 5143 24.3
## 6 1960 BANGLADESH   no  no 52357  952  5.0

We specify all the variables on the left side, the columns which are used to ungroup the data from the variable column, which we specified on the right side of the formula. By default the value is located in the value column. If you have a different column, use value.var to specify its name.

Panel Data Model Example: US Manufacturing Data

We will analyse panel data on 11 large US manufacturing firms over 20 years, for the years 1935-1954.

suppressPackageStartupMessages({
  library(plm)
  library(gplots)
  library(ggplot2)
})
data("Grunfeld", package = "plm")
print(head(Grunfeld), row.names = FALSE)

##  firm year   inv  value capital
##     1 1935 317.6 3078.5     2.8
##     1 1936 391.8 4661.7    52.6
##     1 1937 410.6 5387.1   156.9
##     1 1938 257.7 2792.2   209.2
##     1 1939 330.8 4313.2   203.4
##     1 1940 461.2 4643.9   207.2

where:

inv - gross Investment;
value - value of the firm;
capital - stock of plant and equipment.

Initial data analysis

Let’s begin by looking at the data plots:

p <- ggplot(data = Grunfeld)
p <- p + geom_line(aes(y = inv, x = year, col = factor(firm)))
p <- p + facet_wrap(~ firm, scales = "free")
p <- p + theme(legend.position="none")
print(p)

As well as the summary statistics:

aggregate(inv ~ firm, data = Grunfeld, summary)

##    firm  inv.Min. inv.1st Qu. inv.Median  inv.Mean inv.3rd Qu.  inv.Max.
## 1     1  257.7000    438.6500   538.3500  608.0200    654.2000 1486.7000
## 2     2  209.9000    338.5250   419.5500  410.4750    470.6250  645.5000
## 3     3   33.1000     60.1750    93.5500  102.2900    146.5250  189.6000
## 4     4   40.2900     57.7800    71.0850   86.1235     92.1850  174.9300
## 5     5   39.6700     51.9225    60.3850   61.8025     71.3150   91.9000
## 6     6   20.3600     27.7625    43.1100   55.4110     70.4550  135.7200
## 7     7   23.2100     33.4775    44.2000   47.5955     54.1350   89.5100
## 8     8   12.9300     31.1725    38.5400   42.8915     53.6900   90.0800
## 9     9   20.8900     30.1825    38.1100   41.8890     54.8625   66.1100
## 10   10    0.9300      1.9575     2.2150    3.0845      4.3250    6.5300

aggregate(inv ~ firm, data = Grunfeld, var)

##    firm          inv
## 1     1 95836.450105
## 2     2 15725.016711
## 3     3  2360.453579
## 4     4  1825.473056
## 5     5   230.035820
## 6     6  1221.307936
## 7     7   335.464321
## 8     8   365.199308
## 9     9   221.449283
## 10   10     2.953794

we see that the general scale and variance of investments is different throughout the firms.

As such, we may assume that each firm has a specific, unobserved fixed (i.e. constant throughout time) effect.

We can also examine the presence of heterogeneity across countries and years:

par(mfrow = c(2, 1))
suppressWarnings({
plotmeans(inv ~ factor(firm), main = "Heterogeneity across countries", data = Grunfeld)
plotmeans(inv ~ year, main = "Heterogeneity across years", data = Grunfeld)
})

From the first plot we see that the firms (named 1 through 10) have different averages. This is especially true for the first two and the last one firm. This indicates that the firms do differ from one another.

From the second plot we see that the average investment, calculated across all ten firms, is different each year, which indicates that the investment average varies with time, so a time series model is appropriate.

Note that because of different scales coplot() does not produce an informative plot:

coplot(inv ~ year| factor(firm), type = "b", data = Grunfeld)

We wanted to visualize how inv and year vary with firm, but because of different firm sizes, the scales are not adjusted.. To remedy this, we can take the logs of the data:

coplot(log(inv) ~ year| factor(firm), type = "b", data = Grunfeld, columns = 10)

Note that the first graph shows the range of the observations for the conditioning variable (factor(firm)). As we read left to right, we see that the investments decrease for the different firms - in this case, as the firm ID increases (i.e. when we are looking at a different firm), the the investment amount decreases. Firm 1 has the largest investment amounts and firm 10 - the lowest.

Another short example on coplots can be found here.

Models and their comparisons

We will estimate the following model on our panel data:

eq <- inv ~ value + capital

we would assume that an increase in firm value and capital would lead to an increase in investment, regardless of country-specific properties.

Pooled OLS:

pooled = plm(eq, data = Grunfeld, index = c("firm", "year"), model = "pooling")
summary(pooled)

## Pooling Model
## 
## Call:
## plm(formula = eq, data = Grunfeld, model = "pooling", index = c("firm", 
##     "year"))
## 
## Balanced Panel: n = 10, T = 20, N = 200
## 
## Residuals:
##      Min.   1st Qu.    Median   3rd Qu.      Max. 
## -291.6757  -30.0137    5.3033   34.8293  369.4464 
## 
## Coefficients:
##                Estimate  Std. Error t-value  Pr(>|t|)    
## (Intercept) -42.7143694   9.5116760 -4.4907 1.207e-05 ***
## value         0.1155622   0.0058357 19.8026 < 2.2e-16 ***
## capital       0.2306785   0.0254758  9.0548 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Total Sum of Squares:    9359900
## Residual Sum of Squares: 1755900
## R-Squared:      0.81241
## Adj. R-Squared: 0.8105
## F-statistic: 426.576 on 2 and 197 DF, p-value: < 2.22e-16

Fixed Effects:

fixed = plm(eq, data = Grunfeld, index = c("firm", "year"), model = "within")
summary(fixed)

## Oneway (individual) effect Within Model
## 
## Call:
## plm(formula = eq, data = Grunfeld, model = "within", index = c("firm", 
##     "year"))
## 
## Balanced Panel: n = 10, T = 20, N = 200
## 
## Residuals:
##       Min.    1st Qu.     Median    3rd Qu.       Max. 
## -184.00857  -17.64316    0.56337   19.19222  250.70974 
## 
## Coefficients:
##         Estimate Std. Error t-value  Pr(>|t|)    
## value   0.110124   0.011857  9.2879 < 2.2e-16 ***
## capital 0.310065   0.017355 17.8666 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Total Sum of Squares:    2244400
## Residual Sum of Squares: 523480
## R-Squared:      0.76676
## Adj. R-Squared: 0.75311
## F-statistic: 309.014 on 2 and 188 DF, p-value: < 2.22e-16

Testing for Individual Effects

We want to test whether in our model the unobserved fixed effects are equal to zero, i.e. whether they are equal across all firms (this would lead to a pooled model).

Here, the model under the null hypothesis (i.e. restricted model): \[ y_{it} = x_{it}'\beta + u_{it} \] The model under the alternative hypothesis (i.e. unrestricted model): \[ y_{it} = x_{it}'\beta + c_i + u_{it} \] The null hypothesis is: \[H_0: c_i = 0,\quad \forall i = 1,...,N\] Rejecting this hypothesis means that the fixed effects are non-zero. This can be done via an \(F\) test by comparing fixed effects estimates to the ones from the pooled regression:

\[ F = \dfrac{(\text{SSE}_r - \text{SSE}_u)/\text{df}_1}{\text{SSE}_u / \text{df}_2} \ sim F(\text{df}_1, \text{df}_2) \]

Where:

\(\text{SSE}_r\) - residual sum of squares from the restricted (i.e. pooled) model;
\(\text{SSE}_u\) the residual sum of squares from the unrestricted (i.e. fixed-effects) model;
\(\text{df}_1\), \(\text{df}_2\) - degrees of freedom of the models.

fixef(fixed)

##           1           2           3           4           5           6 
##  -70.296717  101.905814 -235.571841  -27.809295 -114.616813  -23.161295 
##           7           8           9          10 
##  -66.553474  -57.545657  -87.222272   -6.567844

pFtest(fixed, pooled)

## 
##  F test for individual effects
## 
## data:  eq
## F = 49.177, df1 = 9, df2 = 188, p-value < 2.2e-16
## alternative hypothesis: significant effects

We reject the null hypothesis, so the fixed effects are significant.

Random Effects:

random = plm(eq, data = Grunfeld, index = c("firm", "year"), model = "random")
summary(random)

## Oneway (individual) effect Random Effect Model 
##    (Swamy-Arora's transformation)
## 
## Call:
## plm(formula = eq, data = Grunfeld, model = "random", index = c("firm", 
##     "year"))
## 
## Balanced Panel: n = 10, T = 20, N = 200
## 
## Effects:
##                   var std.dev share
## idiosyncratic 2784.46   52.77 0.282
## individual    7089.80   84.20 0.718
## theta: 0.8612
## 
## Residuals:
##      Min.   1st Qu.    Median   3rd Qu.      Max. 
## -177.6063  -19.7350    4.6851   19.5105  252.8743 
## 
## Coefficients:
##               Estimate Std. Error t-value Pr(>|t|)    
## (Intercept) -57.834415  28.898935 -2.0013  0.04674 *  
## value         0.109781   0.010493 10.4627  < 2e-16 ***
## capital       0.308113   0.017180 17.9339  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Total Sum of Squares:    2381400
## Residual Sum of Squares: 548900
## R-Squared:      0.7695
## Adj. R-Squared: 0.76716
## F-statistic: 328.837 on 2 and 197 DF, p-value: < 2.22e-16

Deciding between FE or RE

Now we will test if we should use RE model (null hypothesis) or FE model (alternative hypothesis) specification via a Hausman test.

phtest(fixed, random)

## 
##  Hausman Test
## 
## data:  eq
## chisq = 2.3304, df = 2, p-value = 0.3119
## alternative hypothesis: one model is inconsistent

The alternative is specified as one model is inconsistent. FE is consistent under both the null (but not as efficient) and the alternative, while RE is inconsistent under the alternative.

p-value = 0.3119 > 0.05, so we do not reject the null hypothesis that means that both models are consistent. This means that the RE model is more efficient.

Model Fit

We will look at different model fits:

new.dt <- data.frame(Grunfeld)[, c("firm", "year", "inv")]
#Change the name of the last column
colnames(new.dt)[3] <- "Y.Historical"
#Save firm order :
firm_names  <- paste0("Firm #", sort(unique(new.dt$firm)))
#Change firm names :
new.dt$firm <- paste0("Firm #", new.dt$firm)


new.dt <- data.frame(new.dt, 
                     Y.POLS = as.numeric(pooled$model[[1]] - pooled$residuals),
                     Y.FE = as.numeric(fixed$model[[1]] - fixed$residuals),
                     Y.RE = as.numeric(random$model[[1]] - random$residuals))
new.dt$firm <- factor(new.dt$firm, levels = firm_names)

new.dt <- melt(new.dt, id.vars = c("firm", "year"))

p <- ggplot(data = new.dt,
            aes(x = year, y = value, group = interaction(variable, firm)))

p <- p + geom_line(aes(color = variable, linetype = variable))

p <- p + facet_wrap(~ firm, scales = "free")
p <- p + theme(legend.position="bottom")
p <- p + ggtitle("Gross Investment for different firms: Historical, POLS, FE and RE")
print(p)

Note that the fitted() and predict() functions either will not work, or return completely different results. As such, in order to calculate the fitted values, we need to calculate Y - residuals(...) by extracting them from the model.

Task 10

Plotting data with ggplot() and reshaping data with melt() and dcast()