1.2 Statistical Data types

This section gives a brief introduction of three of the most common data types used in data analysis (specifically, statistical/econometric analysis and modeling).

1.2.1 Cross-sectional Data

Cross-section data is collected in a single time period and is characterized by individual units - people, companies, countries, etc. Some examples include:

  • Student grades at the end of the current semester;
  • Household data of the previous year - expenditure on food, unemployment, income, etc.
  • Car data - average speed, horsepower, color, etc.

With cross-sectional data the ordering of the data does not matter. In other words, we can order the data by ascending, descending or even randomized order and this will not affect out modeling results.

The following data sample gives the speed of cars and the distances taken to stop. The data were recorded in the 1920’s.

#
#
data(cars)
head(cars)
##   speed dist
## 1     4    2
## 2     4   10
## 3     7    4
## 4     7   22
## 5     8   16
## 6     9   10
import statsmodels.api as sm
#
cars = sm.datasets.get_rdataset("cars", "datasets")
print(cars.data.head())
##    speed  dist
## 0      4     2
## 1      4    10
## 2      7     4
## 3      7    22
## 4      8    16

1.2.2 Time Series Data

Data collected at a number of specific points in time is called time series data. Such examples include stock prices, interest rates, exchange rates as well as product prices, GDP, etc. Time series data can be observed at many different frequencies (hourly, daily, weekly, monthly, quarterly, anually, etc.).

Unlike cross-sectional data, the ordering of the data is important in time-series data. Each point represents the values at specific points in time. As such, time series data are typically presented in chronological order. Changing the order of the data ignores the time-dimensionality of the data.

The following data sample is of quarterly observations of Money, GDP and Interest Rate in Canada, where \(m\) is the log of the real money supply, \(y\) is the log of GDP in 1992 dollars, seasonally, adjusted; \(p\) is the log of the price level and \(r\) is the 3-month treasury till rate.

data(Money, package = "Ecdat")
head(data.frame(Money), 6)
##          m        y        p       r
## 1 11.21111 12.62052 -1.49969 4.46333
## 2 11.21075 12.64173 -1.48955 4.17333
## 3 11.20382 12.64643 -1.48414 4.47333
## 4 11.17621 12.65076 -1.47146 5.45333
## 5 11.14330 12.65842 -1.45747 6.69000
## 6 11.11438 12.68715 -1.45569 6.83333
money = sm.datasets.get_rdataset("Money", "Ecdat")
print(money.data.head(6))
##           m         y        p        r
## 0  11.21111  12.62052 -1.49969  4.46333
## 1  11.21075  12.64173 -1.48955  4.17333
## 2  11.20382  12.64643 -1.48414  4.47333
## 3  11.17621  12.65076 -1.47146  5.45333
## 4  11.14330  12.65842 -1.45747  6.69000
## 5  11.11438  12.68715 -1.45569  6.83333

1.2.3 Panel (or Longitudinal) data

Panel data combines cross-sectional and time series data: the same individuals (persons, firms, cities, etc.) are observed at several points in time (days, years, before and after treatment etc.). Panel data allows you to control for variables you cannot observe or measure like:

  • cultural (like country or region specific) factors;
  • difference in business practices across companies;

If we have the same number of time period observations for each individual, then we have a balanced panel.

The following data sample is of Grunfeld Investment Data - a panel of 10 observations from 1935 to 1954 in the US, where firm is the firm ID, year is the date, inv is the gross investment, value is the value of the firm and capital is the stock of plant and equipment.

data(Grunfeld, package = "plm")
head(data.frame(Grunfeld), 40)
##    firm year    inv  value capital
## 1     1 1935  317.6 3078.5     2.8
## 2     1 1936  391.8 4661.7    52.6
## 3     1 1937  410.6 5387.1   156.9
## 4     1 1938  257.7 2792.2   209.2
## 5     1 1939  330.8 4313.2   203.4
## 6     1 1940  461.2 4643.9   207.2
## 7     1 1941  512.0 4551.2   255.2
## 8     1 1942  448.0 3244.1   303.7
## 9     1 1943  499.6 4053.7   264.1
## 10    1 1944  547.5 4379.3   201.6
## 11    1 1945  561.2 4840.9   265.0
## 12    1 1946  688.1 4900.9   402.2
## 13    1 1947  568.9 3526.5   761.5
## 14    1 1948  529.2 3254.7   922.4
## 15    1 1949  555.1 3700.2  1020.1
## 16    1 1950  642.9 3755.6  1099.0
## 17    1 1951  755.9 4833.0  1207.7
## 18    1 1952  891.2 4924.9  1430.5
## 19    1 1953 1304.4 6241.7  1777.3
## 20    1 1954 1486.7 5593.6  2226.3
## 21    2 1935  209.9 1362.4    53.8
## 22    2 1936  355.3 1807.1    50.5
## 23    2 1937  469.9 2676.3   118.1
## 24    2 1938  262.3 1801.9   260.2
## 25    2 1939  230.4 1957.3   312.7
## 26    2 1940  361.6 2202.9   254.2
## 27    2 1941  472.8 2380.5   261.4
## 28    2 1942  445.6 2168.6   298.7
## 29    2 1943  361.6 1985.1   301.8
## 30    2 1944  288.2 1813.9   279.1
## 31    2 1945  258.7 1850.2   213.8
## 32    2 1946  420.3 2067.7   132.6
## 33    2 1947  420.5 1796.7   264.8
## 34    2 1948  494.5 1625.8   306.9
## 35    2 1949  405.1 1667.0   351.1
## 36    2 1950  418.8 1677.4   357.8
## 37    2 1951  588.2 2289.5   342.1
## 38    2 1952  645.5 2159.4   444.2
## 39    2 1953  641.0 2031.3   623.6
## 40    2 1954  459.3 2115.5   669.7
grunfeld = sm.datasets.get_rdataset("Grunfeld", "plm")
print(grunfeld.data.head(40))
##     firm  year     inv   value  capital
## 0      1  1935   317.6  3078.5      2.8
## 1      1  1936   391.8  4661.7     52.6
## 2      1  1937   410.6  5387.1    156.9
## 3      1  1938   257.7  2792.2    209.2
## 4      1  1939   330.8  4313.2    203.4
## 5      1  1940   461.2  4643.9    207.2
## 6      1  1941   512.0  4551.2    255.2
## 7      1  1942   448.0  3244.1    303.7
## 8      1  1943   499.6  4053.7    264.1
## 9      1  1944   547.5  4379.3    201.6
## 10     1  1945   561.2  4840.9    265.0
## 11     1  1946   688.1  4900.9    402.2
## 12     1  1947   568.9  3526.5    761.5
## 13     1  1948   529.2  3254.7    922.4
## 14     1  1949   555.1  3700.2   1020.1
## 15     1  1950   642.9  3755.6   1099.0
## 16     1  1951   755.9  4833.0   1207.7
## 17     1  1952   891.2  4924.9   1430.5
## 18     1  1953  1304.4  6241.7   1777.3
## 19     1  1954  1486.7  5593.6   2226.3
## 20     2  1935   209.9  1362.4     53.8
## 21     2  1936   355.3  1807.1     50.5
## 22     2  1937   469.9  2676.3    118.1
## 23     2  1938   262.3  1801.9    260.2
## 24     2  1939   230.4  1957.3    312.7
## 25     2  1940   361.6  2202.9    254.2
## 26     2  1941   472.8  2380.5    261.4
## 27     2  1942   445.6  2168.6    298.7
## 28     2  1943   361.6  1985.1    301.8
## 29     2  1944   288.2  1813.9    279.1
## 30     2  1945   258.7  1850.2    213.8
## 31     2  1946   420.3  2067.7    132.6
## 32     2  1947   420.5  1796.7    264.8
## 33     2  1948   494.5  1625.8    306.9
## 34     2  1949   405.1  1667.0    351.1
## 35     2  1950   418.8  1677.4    357.8
## 36     2  1951   588.2  2289.5    342.1
## 37     2  1952   645.5  2159.4    444.2
## 38     2  1953   641.0  2031.3    623.6
## 39     2  1954   459.3  2115.5    669.7