1.4 Model Fitting and Model Learning

This section is a draft and is best viewed as a collection of links and a very brief overview in its current state.

In standard econometrics we leverage the available data and fit some kind of model to the data. Model fitting is usually carried out by estimating some unknown model parameters.

In machine learning, model “fitting” is usually referred to as model building and “parameter estimation” is usually referred to as model learning. Depending on the type of model used and the tasks at hand, model learning can be divided into a number of different categories.

1.4.1 Supervised Learning

Supervised learning is concerned with labeled data - that is (historical) data, with some kind of dependent variable (usually denoted as \(Y\)) of interest, which we want to model and which we observe values of. In addition, we usually have a (hopefully sufficiently large) number of additional \(K\) variables (usually denoted as \(X_i\), \(i = 1,...,K\)), which describe some features (a term often used in machine learning) of the data. In econometrics this is usually known as explanatory variables.

There are two main areas where supervised learning is useful:

  • Classification, where the aim is to predict some discrete categorical value by using the historical data to identify the class or group it belongs to. The number of groups is known beforehand from the historical data. Some examples include classifying whether someone is likely to default on their loan, buy a product, quit their job, choose a competitors product, etc. Another example includes image classification (e.g. what kind of animal is displayed in the picture).
  • Regression, where the aim is to predict some continuous value Some examples include modelling of: wages, product prices, number of customers (also known as count data), etc.

Both regression and classification are the two most frequently encountered tasks in econometric analysis.

1.4.2 Unsupervised Learning

Unfortunately, real-world data is usually incomplete. This means that we may now know the true number of categories in the data. In such cases we employ unsupervised learning - we use the available input variables \(X_i\), \(i=1,...,K\). The goal for unsupervised learning is to model the underlying structure of the data, however, unlike supervised learning, there is no “correct”" answer, since there is no corresponding output variable \(Y\).

Unsupervised learning models can be categorized into the following:

  • Clustering - groups the available data into separate clusters/categories/groups, based on the available input variables. Unlike classification, the number of categories is unknown and there are no variables, which indicate, which groups the historical data belongs to.

  • Anomaly detection - also known as Outlier detection - is concerned with detecting suspicious observations, which significantly differ from the remaining data. By definition, outliers are unusual and unexpected values, which may be due to various reasons (anomalies (e.g. having a single premium product among many average products with no way of labeling it as such); measurement errors).

Outliers can cause serious problems in statistical analysis and they are frequently encountered in empirical data. In some cases, outlier data can be discarded, other times it may be useful to categorize such data as separate cases or (in case of measurement error) replace these values with the ones from similar observations (e.g. by finding data with similar explanatory variables \(X_i\)).

  • Association - uses rule-based machine learning methods to identify relations in the data, where certain features of a data sample correlate with other features. It is most commonly applied in market basket analysis in order to to understand the purchase behavior of customers. The applied unsupervised learning algorithm can then be used to up-sell, or recommend, additional products, which may be complementary to a customers purchase. For example, if someone is ordering food online and their shopping basket contains \(\{ milk,\ carrots,\ onions\}\), then, the model can be used to recommend some complementary product(-s), which were most frequently purchased in similar baskets, e.g.:
ID Item basket
1 \(\{Bread, Soda, \color{purple}{Potatoes}\}\)
2 \(\{Wine, Bread\}\)
3 \(\{Wine, Bread, \color{red}{Potatoes}, \color{red}{Milk} \}\)
4 \(\{Soda, \color{red}{Potatoes}, \color{red}{Milk}\}\)
\(\color{teal}{Frequently\ bought\ items}\): \(\{ Milk, Potatoes \}\)
\(\color{teal}{Association\ rule}\): \(\{Milk\} \rightarrow \{Potatoes\}\) (if milk is bought, then potatoes are frequently bought as well).

See an example in this article on DataCamp as well as wikipedia.

  • Autoencoders - take input data, compress it into a code, then try to recreate the input data from that summarized code. The aim is to learn a representation (encoding) for a set of data by training the model to ignore signal “noise”. For example, using both noisy and clean versions of an image in order to train a model, an autoencoder can then be used in image reconstruction by removing the visual noise from images and improve picture quality.

See here for an example of the above image.

Note that, unlike supervised learning, not having a dependent variable makes it difficult to measure the accuracy of an unsupervised model. On the other hand, in such cases where the data is unlabeled, the only other choice is to manually review each data point and decide on the label manually (if we are working with categories), or try to remove the noise by hand (e.g. using some existing rules, or industry experience). Doing this manually is not only time-consuming but difficult as well.

Though in such cases unsupervised learning may prove to be faster and more consistent than the manual implementation.

1.4.3 Semi-supervised Learning

Semi-supervised learning is a medium between supervised learning and unsupervised learning - it uses both labeled and unlabeled data for model training. Usually a small sample of labeled data (which may be time consuming or require hiring additional experts and thus not feasible for the whole dataset) and a large sample of unlabeled data (which may be readily-available unlabeled raw data) is used.

The above example is taken from wikipedia. The dashed line shows the decision boundary for the two categories which we may adopt if our data sample has two labeled points (one white one black circle) as well as a collection of unlabeled data (gray circles).

1.4.4 Reinforcement Learning

Reinforcement learning aims to maximize some notion of cumulative reward. It is one of three basic machine learning paradigms, alongside supervised learning and unsupervised learning. The overall aim of reinforcement learning is to predict the best next step to take in order to maximize the final reward.

Reinforcement learning operates on the same principle as video games - complete a level and increase your high score; lose all your lives and get a game over. As such, video games are a common test environment for these kinds of models.

See Playing Atari with Deep Reinforcement Learning and Retro Games in Gym environments for reinforcement learning (also here).

Another example where reinforcement learning could be very useful is stock trading. For some preliminary results of such applications, see Practical Deep Reinforcement Learning Approach for Stock Trading and Adversarial Deep Reinforcement Learning inPortfolio Management.

1.4.5 Deep Learning

Also known as hierarchical learning. It is part of a broader family of machine learning methods based on artificial neural networks that uses multiple layers to progressively extract higher features from raw input data.

\(\color{red}{\text{Without going into too much detail (chapters on these topics may be added in the future)}}\), this family includes:

  • Deep neural networks (DNN) - an artificial neural network (ANN) with multiple layers between the input and output layers, can model complex non-linear relationships. An artificial neural network is a network of simple elements called artificial neurons, which receive input, change their internal state (activation) according to that input, and produce output depending on the input and activation.
  • Deep belief networks (DBN) - a composition of simple, unsupervised networks such as autoencoders, where each sub-network’s hidden layer serves as the visible layer for the next. This composition leads to a fast, layer-by-layer unsupervised training procedure. The observation that DBNs can be trained greedily, one layer at a time, led to one of the first effective deep learning algorithms.
  • Recurrent neural networks (RNN) - a class of artificial neural networks where connections between nodes form a directed graph along a temporal sequence. Some applications include handwriting recognition, speech recognition.
  • Convolutional neural networks (CNN) - regularized versions of multilayer perceptrons. Multilayer perceptrons usually refer to fully connected networks, that is, each neuron in one layer is connected to all neurons in the next layer. The “fully-connectedness” of these networks makes them prone to overfitting data. CNNs take a different approach towards regularization: they take advantage of the hierarchical pattern in data and assemble more complex patterns using smaller and simpler patterns. Therefore, on the scale of connectedness and complexity, CNNs are on the lower extreme. Most commonly applied to analyzing visual imagery.

Source

See:

1.4.6 Zero-shot Learning

Zero-shot learning extends supervised learning to the case, where training data is unavailable for some classes. The algorithms use some connection between the available information and the unseen classes. See also Featur (or Representation) Learning.

For example, if you’ve read a very detailed description of a zebra, you might be able to tell what a zebra is in a photograph the first time you see it. This can be thought of as a natural language processing task, which uses word embeddings - where words or phrases from the vocabulary are mapped to vectors of real numbers. Words with similar meaning will have similar word-embeddings. Continuing the example, if the training data does not have a zebra image, but has various images of stripped animals (tigers), horse-like animals (horses, donkeys, ponies), black/white animals (pandas, penguins, etc.), then their features (“stripped”, “horse-like”, “black/white”) can be extracted and word-embeddings can be generated. Then we could describe what a zebra looks-like and generate an appropriate dictionary word-embedding using the available features. Finally, we could read an image of a zebra, extract the features of this new image, create a word-embedding and compare with the closest word-embedding from our dictionary, which will likely be that of a zebra.

Source, as well as some more examples with sources

This technique, known as zero-shot learning (ZSL), is still in its infancy and is currently an active research topic.

See: