Introduction

The motivation for this package is to provide functions which help with the development and tuning of machine learning models in biomedical data where the sample size is frequently limited, but the number of predictors may be significantly larger (P >> n). While most machine learning pipelines involve splitting data into training and testing cohorts, typically 2/3 and 1/3 respectively, medical datasets may be too small for this, and so determination of accuracy in the left-out test set suffers because the test set is small. Nested cross-validation (CV) provides a way to get round this, by maximising use of the whole dataset for testing overall accuracy, while maintaining the split between training and testing.

In addition typical biomedical datasets often have many 10,000s of possible predictors, so filtering of predictors is commonly needed. However, it has been demonstrated that filtering on the whole dataset creates a bias when determining accuracy of models (Vabalas et al, 2019). Feature selection of predictors should be considered an integral part of a model, with feature selection performed only on training data. Then the selected features and accompanying model can be tested on hold-out test data without bias. Thus, it is recommended that any filtering of predictors is performed within the CV loops, to prevent test data information leakage.

This package enables nested cross-validation (CV) to be performed using the commonly used glmnet package, which fits elastic net regression models, and the caret package, which is a general framework for fitting a large number of machine learning models. In addition, nestedcv adds functionality to enable cross-validation of the elastic net alpha parameter when fitting glmnet models.

nestedcv partitions the dataset into outer and inner folds (default 10 x 10 folds). The inner fold CV, (default is 10-fold), is used to tune optimal hyperparameters for models. Then the model is fitted on the whole inner fold and tested on the left-out data from the outer fold. This is repeated across all outer folds (default 10 outer folds), and the unseen test predictions from the outer folds are compared against the true results for the outer test folds and the results concatenated, to give measures of accuracy (e.g. AUC and accuracy for classification, or RMSE for regression) across the whole dataset.

A final round of CV is performed on the whole dataset to determine hyperparameters to fit the final model to the whole data, which can be used for prediction with external data.

Variable selection

While some models such as glmnet allow for sparsity and have variable selection built-in, many models fail to fit when given massive numbers of predictors, or perform poorly due to overfitting without variable selection. In addition, in medicine one of the goals of predictive modelling is commonly the development of diagnostic or biomarker tests, for which reducing the number of predictors is typically a practical necessity.

Several filter functions (t-test, Wilcoxon test, anova, Pearson/Spearman correlation, random forest variable importance, and ReliefF from the CORElearn package) for feature selection are provided, and can be embedded within the outer loop of the nested CV.

Installation

install.packages("nestedcv")
library(nestedcv)

Examples

Importance of nested CV

The following simulated example demonstrates the bias intrinsic to datasets where P >> n when applying filtering of predictors to the whole dataset rather than to training folds.

## Example binary classification problem with P >> n
x <- matrix(rnorm(150 * 2e+04), 150, 2e+04)  # predictors
y <- factor(rbinom(150, 1, 0.5))  # binary response

## Partition data into 2/3 training set, 1/3 test set
trainSet <- caret::createDataPartition(y, p = 0.66, list = FALSE)

## t-test filter using whole test set
filt <- ttest_filter(y, x, nfilter = 100)
filx <- x[, filt]

## Train glmnet on training set only using filtered predictor matrix
library(glmnet)
## Loading required package: Matrix
## Loaded glmnet 4.1-8
fit <- cv.glmnet(filx[trainSet, ], y[trainSet], family = "binomial")

## Predict response on test set
predy <- predict(fit, newx = filx[-trainSet, ], s = "lambda.min", type = "class")
predy <- as.vector(predy)
predyp <- predict(fit, newx = filx[-trainSet, ], s = "lambda.min", type = "response")
predyp <- as.vector(predyp)
output <- data.frame(testy = y[-trainSet], predy = predy, predyp = predyp)

## Results on test set
## shows bias since univariate filtering was applied to whole dataset
predSummary(output)
##          Reference
## Predicted  0  1
##         0 18  5
##         1  8 19
## 
##               AUC            Accuracy   Balanced accuracy   
##            0.8413              0.7400              0.7420

## Nested CV
fit2 <- nestcv.glmnet(y, x, family = "binomial", alphaSet = 7:10 / 10,
                      filterFUN = ttest_filter,
                      filter_options = list(nfilter = 100))
fit2
## Nested cross-validation with glmnet
## Filter:  ttest_filter 
## 
## Final parameters:
##    lambda      alpha  
## 0.0001375  0.7000000  
## 
## Final coefficients:
## (Intercept)       V5068      V18605      V18565       V6136      V16524 
##   -0.342181    1.162420    0.971690    0.926991    0.899378   -0.808576 
##       V1096      V13253      V12828       V7739       V7285       V5695 
##    0.775346   -0.773682   -0.713136   -0.712557   -0.695180    0.672615 
##       V2259       V3602      V15251      V19122       V5214       V3183 
##   -0.659286    0.652511   -0.648440   -0.635247   -0.623731   -0.622875 
##      V11128      V10314       V9812       V5500       V3941       V2918 
##    0.595687    0.589447   -0.576614    0.575473   -0.572090   -0.571144 
##       V9478      V11774      V11456       V8418       V6495       V9119 
##    0.558277    0.534328    0.521071   -0.499918    0.497823    0.496642 
##       V9036      V14711      V18054      V16555        V107      V12593 
##    0.495973    0.486175    0.485413   -0.472040   -0.470391   -0.462250 
##       V7939      V16771       V3727      V18700       V4180      V16175 
##   -0.456095    0.453205    0.424451   -0.412546   -0.395391   -0.395102 
##      V18690      V16434       V7193      V18910      V18016       V4332 
##   -0.391721   -0.386519   -0.386153    0.383368    0.349471    0.348220 
##      V18621      V11062      V19206      V17316      V13239       V9770 
##    0.345092   -0.344524    0.341776    0.325725    0.315424   -0.311713 
##      V15062       V5224       V9348       V4610      V18314      V19568 
##   -0.306218    0.295598   -0.286564    0.281177    0.276939    0.273352 
##       V3529      V12036      V16162      V13352      V12396      V13825 
##    0.270421    0.270280   -0.257535   -0.254830   -0.252848    0.248817 
##      V17781      V17817       V1284      V16851      V12796      V15032 
##    0.221728   -0.211620   -0.193409   -0.193001   -0.179213   -0.160006 
##      V18773      V18966       V2203       V5114       V1138       V8666 
##   -0.159573    0.141131    0.135234    0.128998   -0.125391    0.118390 
##      V18796       V6840      V13026        V471       V8960         V62 
##    0.110471    0.107065   -0.084358   -0.075113   -0.069938    0.069336 
##      V10334        V383       V9823      V13812 
##    0.056143    0.050758   -0.027060   -0.005367 
## 
## Result:
##          Reference
## Predicted  0  1
##         0 39 42
##         1 38 31
## 
##               AUC            Accuracy   Balanced accuracy   
##            0.4464              0.4667              0.4656

testroc <- pROC::roc(output$testy, output$predyp, direction = "<", quiet = TRUE)
inroc <- innercv_roc(fit2)
plot(fit2$roc)
lines(inroc, col = 'blue')
lines(testroc, col = 'red')
legend('bottomright', legend = c("Nested CV", "Left-out inner CV folds", 
                                 "Test partition, non-nested filtering"), 
       col = c("black", "blue", "red"), lty = 1, lwd = 2, bty = "n")

In this example the dataset is pure noise. Filtering of predictors on the whole dataset is a source of leakage of information about the test set, leading to substantially overoptimistic performance on the test set as measured by ROC AUC.

Figures A & B below show two commonly used, but biased methods in which cross-validation is used to fit models, but the result is a biased estimate of model performance. In scheme A, there is no hold-out test set at all, so there are two sources of bias/ data leakage: first, the filtering on the whole dataset, and second, the use of left-out CV folds for measuring performance. Left-out CV folds are known to lead to biased estimates of performance as the tuning parameters are ‘learnt’ from optimising the result on the left-out CV fold.

In scheme B, the CV is used to tune parameters and a hold-out set is used to measure performance, but information leakage occurs when filtering is applied to the whole dataset. Unfortunately this is commonly observed in many studies which apply differential expression analysis on the whole dataset to select predictors which are then passed to machine learning algorithms.

Filter features +/-balance samples10-foldCVTrainMeasure performanceon left-out CV foldsTune model parametersNo true hold-outWhole datasetFilter features +/-balance samplesTest tuned model10-foldCV