4 Quick Intro to Machine Learning

This chapter ¹ is a quick overview of principles in machine learning. Skip it if you already know the difference between validation and test data.

The chapter focuses on supervised machine learning (most common learning type in this book). Machine learning models output predictions based on input features which are usually satellite or aerial images in the case of remote sensing. For example, a model might classify crop type based on an RGB satellite image. We train machine learning models using a learning algorithm and a labelled dataset, the training data.

What follows are the steps in machine learning modeling:

The prediction task

First, we have to translate our problem into a prediction task:

Pick a target \(Y\) to predict. For example, the land cover. Or a binary mask of fire yes/no.
Define the task. The task is related to the target value. For example, the task can be classification, regression, or segmentation.
Pick an evaluation metric. An evaluation metric can measure how good a prediction is, based on the prediction and the ground truth. For classification, this might be accuracy, for segmentation this might be area of intersection.
Choose features \(X\). Features are the information from which to predict the target. In remote sensing, these are typically derived from satellite and aerial data, but may also be extended by other data, like ground observations or census data.

Get the data

To train model to fulfill the task, we need data. The data in supervised machine learning are typically represented as pairs: \((x^{(i)}, y^{(i)})_{i=1, \ldots, n}\), where \(x^{(i)}\) is the feature and \(y^{(i)}\) is the target.

This dataset we typically divide into three subsets:

A training dataset, \(D_{train}\) that we use to train the model.
A validation dataset, \(D_{val}\) that we use to validate modeling choices and selection.
A testing dataset, \(D_{test}\) that we use to judge the performance of the model on “unseen” data.

The simplest setup would be to split the data just once into these three buckets. Especially if we have a lot of data and deep learning models that train rather slowly, doing this split only once is common. In situations where we have little data and the models train quickly, we typically split the data multiple times and repeat training, validation, and testing with different subsets to get more stable estimates of our model’s performance.

Train the model

Training a machine learning model means running an algorithm that takes as input the training data and outputs the model. The model itself is again an algorithm that outputs predictions based on input features.

But before we can even train a model, we have to decide:

Which model class are we considering? Do we want to train convolutional neural network? Or a decision tree?
Next, which machine learning algorithm do we want to use? A machine learning algorithm takes the training data and hyperparameter choices as input and produces a model.
This means we also have to decide on the hyperparameters before we can train a model. For example, what is the maximum tree depth allowed in a random forest? Or what should the learning rate of the neural network be?

Training produces a model \(\hat{f}\) by minimizing the error on the training data:

\[\epsilon_{train}(\hat{f}):= \frac{1}{|D_{train}|} \sum_{x,y \,\in D_{train}} L(\hat{f}(x), y),\]

where \(L\) is the loss function the model optimizes for. With some machine learning algorithms you can optimize custom loss functions directly (e.g., with neural networks), while other machine learning algorithms come with a built-in loss they optimize for. For example, when training the good-old decision trees for classification such as CART, they greedily optimize the Gini index.

Validate modeling choices

Training doesn’t guarantee that it will produce the best-performing model. For example, you might not have picked the best model class, because you have used a random forest but the best model would have been a convolutional neural network. And even if you picked the best model class, you might have set the hyperparameters non-optimally. Maybe a different learning rate would yield a better neural network. You can’t make these choices based on the training data, since your models are already optimized on this data and you won’t get a neutral estimate, but a biased one. That’s why you need to use validation dataset \(D_{val}\) for picking a model class and doing hyperparameter optimization.

Evaluate the model on test data

To understand how well the model will perform on “unseen” data, you can’t rely on the evaluation metrics from training and test data. Both will be overly optimistic and therefore biased. Instead, you need to evaluate the performance on a separate test dataset. This will give you a realistic estimate of how well the model will perform.

To evaluate the model, you can use the same loss function that you used to optimize the models. But loss functions are often restricted, for example, when training neural networks you can only use loss functions that are derivable. And for many other algorithms, like random forests, the loss function may be fixed by implementation or you just have a small selection of loss functions to pick from. For example, for classification, you can’t directly use accuracy or F1 score in the optimization.

But for the evaluation metric, you have a lot more freedom. It’s even best practice to have multiple metrics to evaluate the model. For example, while you might have optimized your classification models with binary cross-entropy, you could evaluate it using accuracy, F1-score, and the confusion matrix.

Beyond supervised

All the explanations here were about supervised learning, which is probably the most important type of learning for remote sensing. The other flavors of machine learning are:

In unsupervised learning you don’t have any labels. Clustering and anomaly detection are examples of unsupervised learning. Since this type of ML has no labels, there is much more pressure on picking the right optimization goal and constraints.
In Reinforcement learning, the model is an actor in an environment and selects from different actions to receive rewards. Reinforcement learning is more typical in robotics and games.
Self-supervised learning is useful for pre-training (deep learning) models and for solving certain types of tasks. We will see self-supervised ML algorithms in the foundation models chapter. In a way self-supervised learning is just supervised learning with automatically generated labels.

This short primer on machine learning is adapted from one of my other books, Supervised Machine Learning for Science (Freiesleben and Molnar 2024).↩︎