Table 1

Overview of the most commonly used terminology in machine learning literature (addressed in this review)

Machine learningThe use of mathematical models for capturing structure in data. After the optimisation procedure on example data—so-called training—the models can be used to make predictions about new, unseen data.
FeaturesVisual properties of the data that are quantitatively summarised in an array of numbers. In conventional machine learning, these features are clinically inspired and thus handcrafted, while in deep learning these features are automatically learnt from the data.
Computer-aided detection (CADe)Machine learning algorithms applied to medical data for primary detection of pathology (eg, polyp detection).
Computer-aided diagnosis(CADx)Machine learning algorithms applied to medical data for predicting diagnoses (eg, polyp classification).
Deep learningA form of machine learning in which a neural network of several layers is used, exploiting hierarchical relations in the data. The major difference from conventional machine learning is that these features and relations are all learnt from the data, a property which is also referred to as end-to-end learning.
PretrainingTraining a deep learning algorithm with data that are different from the target data. This technique can be exploited to first train a rough model on a large set that can be fine-tuned using a smaller dataset of interest. ImageNet is by far the most commonly used dataset for pretraining.
Transfer learningThis is used after a deep neural network is pretrained on a large dataset that is different from the target data. Generally, a dataset is used with general imagery not specific to the final purpose of the algorithm. This pretrained model extracts basic discriminating features from the large dataset and these features and their weights are then ‘transferred’ for training and fine-tuning on new data which are specific for the target purpose of the model—often applied when sufficient target data are lacking to train the network from scratch.
HyperparametersAlmost all machine learning models are regulated by so-called hyperparameters, which govern the model architecture and its training procedure. Examples of common hyperparameters in neural networks are the number of layers and the learning rate. These parameters can generally not be optimised during the training process and are typically chosen based on a number of trials using empirically driven approach.
Hyperparameter optimisationThe process of finding the right hyperparameters of a model, based on the performance on the validation set. This is performed either by using a grid-search, in which a number of options are defined for each hyperparameter and all combinations are systematically evaluated, or using a random search, in which the values are randomly sampled from a predefined range.
Ensemble learningInstead of training a single model on the whole dataset, one can also train multiple models that are trained slightly differently to yield a prediction about the same data point. These models are generally trained on different subsets of the data and with slightly different hyperparameters. Averaging the scores of different models generally leads to a better and more robust prediction.
Training datasetA set of data (examples) on which the mathematical model is optimised (trained). In supervised learning, the examples are labelled, and the model is trained to predict the labels of the samples.
Validation datasetA separate set of data samples that can be used to tune the hyperparameters of the model. A model can be trained several times with different hyperparameter values (on the training set) and the ones that achieve the best performance on the validation set are chosen. Often referred to as ‘internal validation’.
Test datasetA set of data samples neither used for training the model nor for optimisation of the hyperparameters. The performance on the test set reflects how good the model generalises to new, unseen data.
Cross-validationA validation approach that is more robust to outliers than a regular hold-out approach. In K-fold cross-validation, the data used for training and validation are split into K parts, after which the model is subsequently trained with K-1 folds of data and validated on the left-out fold. This is repeated for all folds after which the scores are pooled.
OverfittingA phenomenon that occurs when the model is too tightly fitted to the training data and does not generalise to new data (ie, the model only works for the given training examples). Overfitting can be recognised by high training performance combined with low test performance.
Data augmentationA way to artificially enhance the size of a dataset, by adding slightly distorted copies of the original data points to the training set. The samples are distorted in such a way that the labels do not change after applying the transformation (eg, rotation, slight skewing, minor zooming, adding noise). The use of data augmentation generally leads to more robust models.