1 Introduction

Data can be collected on various economic and individual levels. Some examples of collected data include:

  • Microeconomic data:
    • Company/Business/Industry data - sales, expenses, supply, etc.
    • Individual-specific (i.e. household) data - income, employment, education, family members, age, gender, etc.;
  • Macroeconomic data:
    • Population census data - unemployment rate, income percentiles etc.;
    • Inflation rate;
    • Housing market data (home ownership, rent percentage, etc.);
    • Supply of oil, wood, water, electricity and metal, etc.;
  • Financial data - e.g. the trading volume and value of specific stocks or commodities (gold and other metals, food, energy, etc.);

With the rise of social media, mobile and web applications it has become increasingly easier to collect data about various events on:

  • Website traffic data;
  • Social media and news outlet posts/articles/images/podcasts/videos about events, people or products;

Given this vast amount of various data and observations there is a natural need to systemize and analyze data in order to get insights about various factors which could have effects on an individual, company or even country level. Because of this, we can distinguish two types of methodologies:

  1. Data Analysis. Focuses on statistical and econometric methods in order to analyse data. Using these methods data-driven models are created which help better understand and explain the links between various social, economic and financial effects. These models also help in making various decisions, since their effects could be evaluated and quantified based on the created models. Another upside is that the models are usually easy to interpret and it is possible to distinguish specific effects. Econometrics is the application of mathematical and statistical methods to economic data. It is a branch of economics which uses empirical data to analyse the validity of economic relations. Often we can refer to data analysis as econometrics without loss of generality.

  2. Data Science. Unlike data analysis, data science focuses on model complexity using statistical and machine learning algorithms based on vast amounts of various (not necessarily financial nor economic) data. Because of the complexity of these methods and the high volume of data available, the evaluated models do not always have clear interpretations for individual factors, compared to data analysis models. While this makes model evaluation more challenging, however, they provide very accurate predictions and are used frequently when working with large and complex data sets.

In other words - data analysis focuses on finding and interpreting the causality between various effects, while data science focuses on predicting the possible outcomes using the available data.

Another important distinction - the language and terms used to describe certain characteristics or methods. This is mostly due to Data Science being closesly linked to Computer Science. Below we provide a couple of examples:

Table 1.1: A table generated by the longtable package.
Statistics Data Science Meaning
estimation learning use data to estimate an unknown parameter (mean, variance, model coefficients, etc.)
classification supervised learning Predict a discrete value of \(Y\) using values of other variables \(X_1\), \(X_2\), …, \(X_K\)
regression supervised learning Predict a continuous value of \(Y\) using values of other variables \(X_1\), \(X_2\), …, \(X_K\)
clustering unsupervised learning Group the data based on some variables \(X_1\), \(X_2\), …, \(X_K\)
data training sample \((Y_1, X_{1,1}, X_{2,1}, ..., X_{K,1})\), …, \((Y_N, X_{1,N}, X_{2,N}, ..., X_{K,N})\)
covariates features variables \(X_{1,i}\), \(X_{2,i}\), …, \(X_{K,i}\), which are collected for each observation \(i = 1,...,N\)
response label the variable of interest \(Y\)
classifier hypothesis a map \(f\) of covariates to outcomes which describes their relationship, i.e. \(f: X \rightarrow Y\)
hypothesis - a subset of a parameter space

Having said that, there are methods which are applicable to both data analysis and data science and in some cases the line between a data analyst and a data scientist may become blurry. As such, this books provides a practical overview of various methods and applications when dealing with economic data with select chapters dedicated for introductory methods to data science. The goal is to provide a broad toolbox of methods for various data types. These methods can then be combined in various ways for use when working on practical applications.

For a discussion on the software used in this book, please refer to Chapter ??. The focus is not on the documentation of the functions themselves, as they may become obsolete in the future, but rather on the methodology and implementation. As such, much of the implementations focus on readability rather than optimization, i.e. some functions may run slower, but they can be read and re-implemented either for a different programming language, or by focusing on optimal calculation speed.

The sections of this book are, for the most part, ordered by their complexity, i.e. methods used in cross-sectional data are also used and expanded on in time series data, which are further expanded upon in panel data. Throughout these chapters some additional data-driven (i.e. data-science-oriented) methods will also be provided, some of which may be provided as a separate chapter.