Practical Econometrics & Data Science

Book I: Cross-sectional data

Author

Andrius Buteikis

Published

26 May, 2024 (First version: 22 February, 2024)

Preface

Cross-sectional data refers to data collected by observing different subjects (e.g. individuals, corporations, countries, animals, etc.). Usually this type of data spans a single point in time, but the data can also be collected during a period of time, e.g. over a specific year.

Cross-sectional data may also refer to data collected on the same subjects in different times - such data are referred to as pooled cross-sectional data. Such data could be used to identify the so called treatment effects, where subjects are observed before and after a treatment is administered to some of them1.

Cross sectional data can be analysed using various statistical and machine learning methods, which can be classified as2:

  • Supervised learning is a collection of methods that can be applied to data where the dependent variable (i.e. the object of interest) is known and observed in the data, using other variables which describe other subject characteristics. For example, we may have data on a person’s wage, as well as their other characteristics, such as age, years of experience, gender, family status, education and so on, with the goal of identifying which persons characteristics (of features) affect their wage, and how much. Supervised learning methods can be classified into two main categories:
    • Regression, which includes linear regression, quantile regression, principal component regression and other regression models. These methods are used when our variable of interest is continuous (e.g. wage in Euros, weight in tonnes, etc.);
    • Classification includes methods such as logistic regression, decision trees, k-nearest neighbors algorithm, support vector machines and is used when our variable of interest is categorical (e.g. 0 (failure) or 1 (success); 0 (bad), 1 (average), 2 (good), and so on).
  • Unsupervised learning defines a collection of methods, which are used when there is no variable of interest but many different variables for each subject. We may be interested in grouping similar subjects together, identifying anomalies (e.g. for fraud detection), or
    • Clustering methods, such as k-mean clustering, hierarchical clustering and DBSCAN.
    • Anomaly detection like isolation forest as well as classic outlier detection methods.
    • Dimensionality reduction techniques, such as factor analysis, principal component analysis and independent component analysis.
Important

Currently, only exercises and exercise examples are added. Eventually, material relating to regression models, interpretation and model characteristics will be added.


  1. See difference-in-differences for more.↩︎

  2. The full list of methods are outside the scope of these notes, so we mention some of the more popular methods. The notes themselves will focus on the more commonly used regression and classification methods.↩︎