---
title: "Learning algorithms and hyperparameter optimisation"
author: "Alex Zwanenburg"
date: "2024-09-20"
output:
  rmarkdown::html_vignette:
    includes:
      in_header: familiar_logo.html
      after_body: license.html
    toc: TRUE
  rmarkdown::github_document:
    html_preview: FALSE
    includes:
      in_header: familiar_logo.html
      after_body: license.html
    toc: TRUE
bibliography: "refs.bib"
vignette: >
  %\VignetteIndexEntry{Learning algorithms and hyperparameter optimisation}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---


Learning algorithms create models that relate input data to the outcome.
Development data is used to train models, after which they can be used to
predict outcomes for new data. Familiar implements several commonly used
algorithms. This vignette first describes the learners and their
hyperparameters. Then, hyperparameter optimisation is described in more detail.

|**learner** | **tag** | **binomial** | **multinomial** | **continuous** | **count** | **survival** |
|:------------|:------------|:-----:|:-----:|:-----:|:-----:|:-----:|
|**generalised linear models**|
| general^a^ | `glm`                        	    | × | × | × | × | × |
| logistic | `glm_logistic`              		      | × |   |   |   |   |
| cauchy | `glm_cauchy` 						              | × |   |   |   |   |
| complementary log-log | `glm_loglog` 			      | × | 	| 	| 	| 	|
| normal | `glm_probit` 						              | × | 	| 	| 	| 	|
| multinomial | `glm_multinomial` 				        | 	| × | 	| 	| 	|
| log-normal | `glm_log`, `glm_log_gaussian` 	    | 	| 	| × | 	| 	|
| normal (gaussian) | `glm_gaussian` 			        | 	| 	| × | 	| 	|
| inverse gaussian | `glm_inv_gaussian` 		      | 	| 	| × | 	| 	|
| log-poisson | `glm_log_poisson` 				        | 	| 	| x	| × | 	|
| poisson | `glm_poisson` 						            | 	| 	| x	| × | 	|
| cox | `cox` 									                  | 	| 	| 	|	  | × |
| exponential | `survival_regr_exponential` 	    | 	| 	| 	|	  | × |
| gaussian | `survival_regr_gaussian` 			      | 	| 	| 	| 	| × |
| logistic | `survival_regr_logistic` 			      | 	| 	| 	| 	| × |
| log-logistic | `survival_regr_loglogistic` 	    | 	| 	| 	| 	| × |
| log-normal | `survival_regr_lognormal` 		      | 	| 	| 	| 	| × |
| survival regression^a^ | `survival_regr` 		    | 	| 	| 	| 	| × |
| weibull | `survival_regr_weibull` 			        | 	| 	| 	| 	| × |
|**lasso regression models**|
| general^a^ | `lasso` 							              | × | × | × | × | × |
| logistic | `lasso_binomial` 					          | × | 	| 	| 	| 	|
| multi-logistic | `lasso_multinomial` 			      | 	| × | 	| 	| 	|
| normal (gaussian) | `lasso_gaussian` 			      | 	| 	| × | 	| 	|
| poisson | `lasso_poisson`						            | 	| 	| 	| × |	  |
| cox | `lasso_cox` 							                | 	| 	| 	| 	| × |
|**ridge regression models**|
| general^a^ | `ridge`                            | × | × | × | × | × |
| logistic | `ridge_binomial`                     | × |   |   |   |   |
| multi-logistic | `ridge_multinomial`            |   | × |   |   |   |
| normal (gaussian) | `ridge_gaussian`            |   |   | × |   |   |
| poisson | `ridge_poisson`                       |   |   |   | × |   |
| cox | `ridge_cox`                               |   |   |   |   | × |
|**elastic net regression models**|
| general^a^ | `elastic_net`                      | × | × | × | × | × |
| logistic | `elastic_net_binomial`               | × |   |   |   |   |
| multi-logistic | `elastic_net_multinomial`      |   | × |   |   |   |
| normal (gaussian) | `elastic_net_gaussian`      |   |   | × |   |   |
| poisson | `elastic_net_poisson`                 |   |   |   | × |   |
| cox | `elastic_net_cox`                         |   |   |   |   | × |
|**boosted linear models**|
| general^a^ | `boosted_glm`                      | × |   | × | × | × |
| area under the curve | `boosted_glm_auc`        | × |   |   |   |   | 
| cauchy | `boosted_glm_cauchy`                   | × |   |   |   |   | 
| complementary log-log | `boosted_glm_loglog`    | × |   |   |   |   | 
| logistic | `boosted_glm_logistic`               | × |   |   |   |   | 
| log-logistic | `boosted_glm_log`                | × |   |   |   |   | 
| normal | `boosted_glm_probit`                   | × |   |   |   |   | 
| gaussian | `boosted_glm_gaussian`               |   |   | × |   |   | 
| huber loss | `boosted_glm_huber`                |   |   | × |   |   | 
| laplace | `boosted_glm_laplace`                 |   |   | × |   |   | 
| poisson | `boosted_glm_poisson`                 |   |   |   | × |   | 
| concordance index | `boosted_glm_cindex`        |   |   |   |   | × | 
| cox | `boosted_glm_cox`                         |   |   |   |   | × | 
| log-logistic | `boosted_glm_loglog`             |   |   |   |   | × | 
| log-normal | `boosted_glm_lognormal`            |   |   |   |   | × | 
| rank-based estimation | `boosted_glm_gehan`     |   |   |   |   | × | 
| survival regression^a^ | `boosted_glm_surv`     |   |   |   |   | × | 
| weibull | `boosted_glm_weibull`                 |   |   |   |   | × | 
|**extreme gradient boosted linear models**|
| general^a^ | `xgboost_lm`                       | × | × | × | × | × |
| logistic | `xgboost_lm_logistic`                | × | × | × |   |   |
| gamma | `xgboost_lm_gamma`                      |   |   | × |   |   | 
| gaussian | `xgboost_lm_gaussian`                |   |   | × | x |   |
| poisson | `xgboost_lm_poisson`                  |   |   |   | × |   |
| cox | `xgboost_lm_cox`                          |   |   |   |   | × |
|**random forests**|
| random forest (RFSRC) | `random_forest_rfsrc`   | × | × | × | × | × |
| random forest (ranger) | `random_forest_ranger` | × | × | × | × | × | 
|**boosted regression trees**|
| general^a^ | `boosted_tree`                     | × |   | × | × | × |
| area under the curve | `boosted_tree_auc`       | × |   |   |   |   | 
| cauchy | `boosted_tree_cauchy`                  | × |   |   |   |   | 
| complementary log-log | `boosted_tree_loglog`   | × |   |   |   |   | 
| logistic | `boosted_tree_logistic`              | × |   |   |   |   | 
| log-logistic | `boosted_tree_log`               | × |   |   |   |   | 
| normal | `boosted_tree_probit`                  | × |   |   |   |   | 
| gaussian | `boosted_tree_gaussian`              |   |   | × |   |   | 
| huber loss | `boosted_tree_huber`               |   |   | × |   |   | 
| laplace | `boosted_tree_laplace`                |   |   | × |   |   | 
| poisson | `boosted_tree_poisson`                |   |   |   | × |   | 
| concordance index | `boosted_tree_cindex`       |   |   |   |   | × | 
| cox | `boosted_tree_cox`                        |   |   |   |   | × | 
| log-logistic | `boosted_tree_loglog`            |   |   |   |   | × | 
| log-normal | `boosted_tree_lognormal`           |   |   |   |   | × | 
| rank-based estimation | `boosted_tree_gehan`    |   |   |   |   | × | 
| survival regression^a^ | `boosted_tree_surv`    |   |   |   |   | × | 
| weibull | `boosted_tree_weibull`                |   |   |   |   | × | 
|**extreme gradient boosted trees**|
| general^a^ | `xgboost_tree`                     | × | × | × | × | × |
| logistic | `xgboost_tree_logistic`              | × | × | × |   |   |
| gamma | `xgboost_tree_gamma`                    |   |   | × |   |   | 
| gaussian | `xgboost_tree_gaussian`              |   |   | × | x |   |
| poisson | `xgboost_tree_poisson`                |   |   |   | × |   |
| cox | `xgboost_tree_cox`                        |   |   |   |   | × |
|**extreme gradient boosted DART trees**|
| general^a^ | `xgboost_dart`                     | × | × | × | × | × |
| logistic | `xgboost_dart_logistic`              | × | × | × |   |   |
| gamma | `xgboost_dart_gamma`                    |   |   | × |   |   | 
| gaussian | `xgboost_dart_gaussian`              |   |   | × | x |   |
| poisson | `xgboost_dart_poisson`                |   |   |   | × |   |
| cox | `xgboost_dart_cox`                        |   |   |   |   | × |
|**bayesian models**|
| naive bayes | `naive_bayes`                     | × | × |   |   |   |
|**nearest neighbour models**|
| k-nearest neighbours^b^ | `k_nearest_neighbours_*`, `knn_*`           | × | × | x | x |   |
|**support vector machines**|
| $C$-classification^c^ | `svm_c_*`                                     | × | × |   |   |   |
| $\nu$-classification/ regression^c^ | `svm_nu_*`                      | × | × | × | × |   |
| $\epsilon$-regression^c^ | `svm_eps_*`                                |   |   | × | × |   |
| linear kernel^d^ | `svm_*_linear`               | × | × | × | × |   |
| polynomial kernel^d^ | `svm_*_polynomial`       | × | × | × | × |   |
| radial kernel^d^ | `svm_*_radial`               | × | × | × | × |   |
| sigmoid kernel^d^ | `svm_*_sigmoid`             | × | × | × | × |   |

Table: Overview of learners implemented in familiar. ^a^ These learners test
multiple distributions or linking families and attempt to find the best option.
^b^ k-nearest neighbours learners allow for setting the distance metric using
`*`. If omitted (e.g. `knn`), the kernel is either `euclidean` (numeric
features) or `gower` (mixed features). ^c^ The SVM kernel is indicated by `*`.
If omitted (e.g. `svm_c`), the radial basis function is used as kernel. ^d^ SVM
type is indicated by `*`, and must be one of `c`, `nu`, or `epsilon`.

## Configuration options

Learners, their hyperparameters, and parallelisation options can be specified
using the tags or arguments below:

| **tag** / **argument** | **description** | **default** |
|:----------:|:-------------------------|:-----------:|
| `learner` | The desired learner. Multiple learners may be provided at the same time. This setting has no default and must be provided. | -- (required) |
| `hyperparameter` | Each learner has one or more hyperparameters which can be specified. Sequential model-based optimisation is used to identify hyperparameter values from the data unless these are specifically specified here. See the section on hyperparameter optimisation for more details. | -- (optional) |
| `parallel_model_development` | Enables parallel processing for model development. Ignored if `parallel=FALSE`. | `TRUE` |

Table: Configuration options for model development.

## Generalised linear models

Generalised linear models (GLM) are easy to understand, use and share. In many
situations, GLM may be preferred over more complex models as they are easier to
report and validate. The most basic GLM is the linear model. The linear model
for an outcome variable $y_j$ and a single predictor variable $x_j$ for sample
$j$ is:

$$y_j=\beta_0 + \beta_1 x_j + \epsilon_j$$
Here, $\beta_0$ and $\beta_1$ are both model coefficients. $\beta_0$ is also
called the model intercept. $\epsilon_j$ is an error term that describes the
difference between the predicted and the actual response for sample $j$. When a
linear model is developed, the $\beta$-coefficients are set in such manner that
the mean-squared error over the sample population is minimised.

The above model is a univariate model as it includes only a single predictor. A
multivariate linear model includes multiple predictors $\mathbf{X_j}$ and is
denoted as:

$$y_j= \mathbf{\beta}\mathbf{X_j} +  \epsilon_j$$

$\mathbf{\beta}\mathbf{X_j}$ is the linear predictor. The GLM generalises this
linear predictor through a linking function $g$ [@Nelder1972-rs]:

$$y_j=g\left(\mathbf{\beta}\mathbf{X}\right) + \epsilon_j$$

For example, binomial outcomes are commonly modelled using logistic regression
with the `logit` linking function. Multiple linking functions are available in
familiar and are detailed below.

### Linear models for binomial outcomes

Linear models for binomial outcomes derive from the `stats` package which is
part of the R core distribution [@rcore2018]. Hyperparameters for these models
include linkage functions and are shown in the table below.

| **parameter** | **tag** | **values** | **optimised** | **comments** |
|:--------------|:--------|:----------:|:----------:|:-------------|
| signature size |`sign_size` | $\mathbb{Z} \in \left[1,n\right]$ | yes | -- |
| linking function |`family` | `logistic`, `probit`, `loglog`, `cauchy` | `glm` only | The linking function is not optimised when it is specified, e.g. `glm_logistic`.|
| sample weighting | `sample_weighting` | `inverse_number_of_samples`, `effective_number_of_samples`, `none` | no | Sample weighting allows for mitigating class imbalances. The default is `inverse_number_of_samples`. Instances with the majority class receive less weight. |
| beta | `sample_weighting_beta` | $\mathbb{Z} \in \left[-6,-1\right]$ | `effective_number_of_samples` only | Specifies the $\beta$ parameter for effective number of samples weighting [@Cui2019-ys]. It is expressed on the $\log_{10}$ scale: $\beta=1-10^\texttt{beta}$. |

### Linear models for multinomial outcomes

Prior to version 1.3.0 the linear model for multinomial outcomes was implemented
using the `VGAM::vglm` function [@Yee1996-ql; @Yee2010-rg; @Yee2015-bw]. For
better scalability, this has been replaced by the `nnet::multinom` function
[@Venables2002-ma]. Hyperparameters for this model are shown in the table below.

| **parameter** | **tag** | **values** | **optimised** | **comments** |
|:--------------|:--------|:----------:|:----------:|:-------------|
| signature size |`sign_size` | $\mathbb{Z} \in \left[1,n\right]$ | yes | -- |
| linking function |`family` | `multinomial` | no | There is only one linking function available.|
| sample weighting | `sample_weighting` | `inverse_number_of_samples`, `effective_number_of_samples`, `none` | no | Sample weighting allows for mitigating class imbalances. The default is `inverse_number_of_samples`. Instances with the majority class receive less weight. |
| beta | `sample_weighting_beta` | $\mathbb{Z} \in \left[-6,-1\right]$ | `effective_number_of_samples` only | Specifies the $\beta$ parameter for effective number of samples weighting [@Cui2019-ys]. It is expressed on the $\log_{10}$ scale: $\beta=1-10^\texttt{beta}$. |

### Linear models for continuous and count-type outcomes

Linear models for continuous and count-type outcomes derive from the `stats`
package of the R core distribution [@rcore2018]. Hyperparameters for these
models include linkage functions and are shown in the table below.

| **parameter** | **tag** | **values** | **optimised** | **comments** |
|:--------------|:--------|:----------:|:----------:|:-------------|
| signature size |`sign_size` | $\mathbb{Z} \in \left[1,n\right]$ | yes | -- |
| linking function |`family` | `gaussian`, `log_gaussian`, `inv_gaussian`, `poisson`, `log_poisson` | `glm` only | The linking function is not optimised when it is specified, e.g. `glm_poisson`. `gaussian`, `log_gaussian`, `inv_gaussian` linking functions are not available for count-type outcomes.|

### Linear models for survival outcomes

Linear models for survival outcomes are divided into semi-parametric and
parametric models. The semi-parametric Cox proportional hazards model
[@Cox1972-fc] is based on the `survival::coxph` function [@Therneau2000-jv].
Tied survival times are resolved using the default method by @Efron1977-ww.

Various fully parametric models are also available, which differ in the assumed
distribution of the outcome, i.e. the linking function. The parametric models
are all based on `survival::survreg` [@Therneau2000-jv].

Hyperparameters for semi-parametric and parametric survival models are shown in
the table below.

| **parameter** | **tag** | **values** | **optimised** | **comments** |
|:--------------|:--------|:----------:|:----------:|:-------------|
| signature size |`sign_size` | $\mathbb{Z} \in \left[1,n\right]$ | yes | -- |
| linking function |`family` | `weibull`, `exponential`, `gaussian`, `logistic`, `lognormal`, `loglogistic` | `survival_regr` only | The linking function is not optimised when it is specified, e.g. `survival_regr_weibull`. The non-parametric `cox` learner does not have a linking function.|

## Lasso, ridge and elastic net regression

Generalised linear models can be problematic as there is no inherent limit to
model complexity, which can easily lead to overfitting. Penalised regression, or
shrinkage, methods address this issue by penalising model complexity.

Three shrinkage methods are implemented in `familiar`, namely ridge regression,
lasso and elastic net, which are all implemented using the `glmnet` package
[@Hastie2009-ed; @Simon2011-ih]. The set of hyperparameters is shown in the
table below. The optimal `lambda` parameter is determined by cross-validation as
part of the `glmnet::cv.glmnet` function, and is not directly determined using
hyperparameter optimisation.

| **parameter** | **tag** | **values** | **optimised** | **comments** |
|:--------------|:--------|:----------:|:----------:|:-------------|
| signature size |`sign_size` | $\mathbb{Z} \in \left[1,n\right]$ | yes | -- |
| elastic net penalty | `alpha` | $\mathbb{R} \in \left[0,1\right]$ | elastic net | This penalty is fixed for ridge regression (`alpha = 0`) and lasso (`alpha = 1`). |
| optimal lambda | `lambda_min` | `lambda.1se`, `lambda.min` | no | Default is `lambda.min`.|
| number of CV folds | `n_folds` | $\mathbb{Z} \in \left[3,n\right]$ | no | Default is $3$ if $n<30$, $\lfloor n/10\rfloor$ if $30\leq n \leq 200$ and $20$ if $n>200$.|
| sample weighting | `sample_weighting` | `inverse_number_of_samples`, `effective_number_of_samples`, `none` | no | Sample weighting allows for mitigating class imbalances. The default is `inverse_number_of_samples` for `binomial` and `multinomial` outcomes, and `none` otherwise. Instances with the majority class receive less weight. |
| beta | `sample_weighting_beta` | $\mathbb{Z} \in \left[-6,-1\right]$ | `effective_number_of_samples` only | Specifies the $\beta$ parameter for effective number of samples weighting [@Cui2019-ys]. It is expressed on the $\log_{10}$ scale: $\beta=1-10^\texttt{beta}$. |
| normalisation | `normalise` | `FALSE`, `TRUE` | no | Default is `FALSE`, as normalisation is part of pre-processing in familiar.|

## Boosted generalised linear models and regression trees

Boosting is a procedure which combines many weak learners to form a more
powerful panel [@Schapire1990-vn; @Hastie2009-ed]. Boosting learners are
implemented using the `mboost` package [@Buhlmann2007-ku; @Hothorn2010-cu;
@Hofner2015-tt]. Both linear regression (`mboost::glmboost`) and regression
trees (`mboost::blackboost`) are implemented. The hyperparameters are shown in
the table below.

| **parameter** | **tag** | **values** | **optimised** | **comments** |
|:--------------|:--------|:----------:|:----------:|:-------------|
| signature size | `sign_size` | $\mathbb{Z} \in \left[1,n\right]$ | yes | -- |
| family | `family` | `logistic`, `probit`, `bin_loglog`, `cauchy`, `log`, `auc`, `gaussian`, `huber`, `laplace`, `poisson`, `cox`, `weibull`, `lognormal`, `surv_loglog`, `gehan`, `cindex` | general tags | The family is not optimised when it is specified, e.g. `boosted_tree_gaussian`.|
| boosting iterations | `n_boost` | $\mathbb{R} \in \left[0,\infty\right)$ | yes | This parameter is expressed on the $\log_{10}$ scale, i.e. the actual input value will be $10^\texttt{n_boost}$. The default range is $\left[0, 3\right]$. |
| learning rate | `learning_rate` | $\mathbb{R} \in \left[-\infty,0\right]$ | yes | This parameter is expressed on the $\log_{10}$ scale. The default range is $\left[-5,0\right]$.|
| sample weighting | `sample_weighting` | `inverse_number_of_samples`, `effective_number_of_samples`, `none` | no | Sample weighting allows for mitigating class imbalances. The default is `inverse_number_of_samples` for `binomial` and `multinomial` outcomes, and `none` otherwise. Instances with the majority class receive less weight. |
| beta | `sample_weighting_beta` | $\mathbb{Z} \in \left[-6,-1\right]$ | `effective_number_of_samples` only | Specifies the $\beta$ parameter for effective number of samples weighting [@Cui2019-ys]. It is expressed on the $\log_{10}$ scale: $\beta=1-10^\texttt{beta}$. |
| maximum tree depth | `tree_depth` | $\mathbb{Z} \in \left[1,n\right]$ | trees only | Maximum depth to which trees are allowed to grow.|
| minimum sum of instance weight | `min_child_weight` | $\mathbb{R} \in \left[0,\infty\right)$ | trees only | Minimal instance weight required for further branch partitioning, or the number of instances required in each node. This parameter is expressed on the $\log_{10}$ scale with a $-1$ offset. The default range is $\left[0, 2\right]$.|
| significance split threshold | `alpha` | $\mathbb{R} \in \left(0.0, 1.0\right]$ | trees only | Minimum significance level for further splitting. The default range is $\left[10^{-6}, 1.0\right]$ |

Optimising the number of boosting iterations may be slow as models with a large
number of boosting iterations take longer to learn and assess. If this is
an issue, the `n_boost` parameter should be set to a smaller range, or provided
with a fixed value.

Low learning rates may require an increased number of boosting iterations.

Also note that hyperparameter optimisation time depends on the type of family
chosen, e.g. `cindex` is considerably slower than `cox`.

## Extreme gradient boosted linear models and trees

Boosting is a procedure which combines many weak learners to form a more
powerful panel [@Schapire1990-vn; @Hastie2009-ed]. Extreme gradient boosting is
a gradient boosting implementation that was highly successful in several machine
learning competitions. Learners are implemented using the `xgboost` package
[@Chen2016-lo]. Three types are implemented: regression based boosting
(`xgboost_lm`), regression-tree based boosting (`xgboost_tree`) and
regression-tree based boosting with drop-outs (`xgboost_dart`).

| **parameter** | **tag** | **values** | **optimised** | **comments** |
|:--------------|:--------|:----------:|:----------:|:-------------|
| signature size | `sign_size` | $\mathbb{Z} \in \left[1,n\right]$ | yes | -- |
| family | `learn_objective` | `gaussian,` `continuous_logistic`, `multinomial_logistic`, `binomial_logistic`, `poisson`, `gamma`, `cox` | general tags | The family is not optimised when it is specified, e.g. `xgboost_lm_linear`.|
| boosting iterations | `n_boost` | $\mathbb{R} \in \left[0,\infty \right)$ | yes | This parameter is expressed on the $\log_{10}$ scale, i.e. the actual value will be $10^\texttt{n_boost}$. The default range is  $\left[0, 3 \right]$.|
| learning rate | `learning_rate` | $\mathbb{R} \in \left(-\infty,0\right]$ | yes | This parameter is expressed on the $\log_{10}$ scale. The default range is $\left[-5, 0 \right]$.|
| L1 regularisation | `alpha` | $\mathbb{R} \in \left[-6, \infty\right)$ | yes | This parameter is expressed on the $\log_{10}$ scale with a $10^{-6}$ offset. The default range is $\left[-6, 3\right]$.|
| L2 regularisation | `lambda` | $\mathbb{R} \in \left[-6, \infty\right)$ | yes | This parameter is expressed on the $\log_{10}$ scale with a $10^{-6}$ offset. The default range is $\left[-6, 3\right]$.|
| sample weighting | `sample_weighting` | `inverse_number_of_samples`, `effective_number_of_samples`, `none` | no | Sample weighting allows for mitigating class imbalances. The default is `inverse_number_of_samples` for `binomial` and `multinomial` outcomes, and `none` otherwise. Instances with the majority class receive less weight. |
| beta | `sample_weighting_beta` | $\mathbb{Z} \in \left[-6,-1\right]$ | `effective_number_of_samples` only | Specifies the $\beta$ parameter for effective number of samples weighting [@Cui2019-ys]. It is expressed on the $\log_{10}$ scale: $\beta=1-10^\texttt{beta}$. |
| maximum tree depth | `tree_depth` | $\mathbb{Z} \in \left[1,\infty\right)$ | `gbtree`, `dart` | Maximum depth to which trees are allowed to grow. The default range is $\left[1, 10\right]$. |
| subsampling fraction | `sample_size` | $\mathbb{R} \in \left(0, 1.0\right]$ | `gbtree`, `dart` | Fraction of available data that is used for to create a single tree. The default range is $\left[2 / m, 1.0\right]$, with $m$ the number of samples. |
| minimum sum of instance weight | `min_child_weight` | $\mathbb{R} \in \left[0,\infty\right)$ | `gbtree`, `dart` | Minimal instance weight required for further branch partitioning, or the number of instances required in each node. This parameter is expressed on the $\log_{10}$ scale with a $-1$ offset. The default range is $\left[0, 2\right]$.|
| min. splitting error reduction | `gamma` | $\mathbb{R} \in \left[-6, \infty\right)$ | `gbtree`, `dart` | Minimum error reduction required to allow splitting. This parameter is expressed on the $\log_{10}$ scale with a $10^{-6}$ offset. The default range is $\left[-6, 3\right]$. `continuous` and `count`-type outcomes are normalised to the $\left[0, 1\right]$ range prior to model fitting to deal with scaling issues. Values are converted back to the original scale after prediction. |
| DART sampling algorithm | `sample_type` | `uniform`, `weighted` | `dart` | -- |
| drop-out rate | `rate_drop` | $\mathbb{R} \in \left[0,1\right)$ | `dart` | -- |

## Random forest (RFSRC)

Random forests (explain) [@Breiman2001-ao]. The random forest learner is
implemented through the `randomForestSRC` package, which provides a unified
interface for different types of forests [@Ishwaran2008-hz; @Ishwaran2011-gu].

An outcome variable that represents count-type data is first transformed using a
$\log(x+1)$ transformation. Predicted responses are then transformed to the
original scale using the inverse $\exp(x)-1$ transformation.

Hyperparameters for random forest learners are shown in the table below.

| **parameter** | **tag** | **values** | **optimised** | **comments** |
|:--------------|:--------|:----------:|:----------:|:-------------|
| signature size | `sign_size` | $\mathbb{Z} \in \left[1,n\right]$ | yes | -- |
| number of trees | `n_tree` | $\mathbb{Z} \in \left[0,\infty\right)$ | yes |  This parameter is expressed on the $\log_{2}$ scale, i.e. the actual input value will be $2^\texttt{n_tree}$ [@Oshiro2012-mq]. The default range is $\left[4, 10\right]$.|
| subsampling fraction | `sample_size` | $\mathbb{R} \in \left(0, 1.0\right]$ | yes | Fraction of available data that is used for to create a single tree. The default range is $\left[2 / m, 1.0\right]$, with $m$ the number of samples. |
| number of features at each node | `m_try` | $\mathbb{R} \in \left[0.0, 1.0\right]$ | yes | Familiar ensures that there is always at least one candidate feature. |
| node size | `node_size` | $\mathbb{Z} \in \left[1, \infty\right)$ | yes | Minimum number of unique samples in terminal nodes. The default range is $\left[5, \lfloor m / 3\rfloor\right]$, with $m$ the number of samples.|
| maximum tree depth | `tree_depth` | $\mathbb{Z} \in \left[1,\infty\right)$ | yes | Maximum depth to which trees are allowed to grow. The default range is $\left[1, 10\right]$. |
| number of split points | `n_split` | $\mathbb{Z} \in \left[0, \infty\right)$ | no | By default, splitting is deterministic and has one split point ($0$).|
| splitting rule | `split_rule` | `gini`, `auc`, `entropy`, `mse`, `quantile.regr`, `la.quantile.regr`, `logrank`, `logrankscore`, `bs.gradient` | no | Default splitting rules are `gini` for `binomial` and `multinonial` outcomes, `mse` for `continuous` and `count` outcomes and `logrank` for `survival` outcomes.|
| sample weighting | `sample_weighting` | `inverse_number_of_samples`, `effective_number_of_samples`, `none` | no | Sample weighting allows for mitigating class imbalances. The default is `inverse_number_of_samples` for `binomial` and `multinomial` outcomes, and `none` otherwise. Instances with the majority class receive less weight. |
| beta | `sample_weighting_beta` | $\mathbb{Z} \in \left[-6,-1\right]$ | `effective_number_of_samples` only | Specifies the $\beta$ parameter for effective number of samples weighting [@Cui2019-ys]. It is expressed on the $\log_{10}$ scale: $\beta=1-10^\texttt{beta}$. |

Note that optimising that optimising the number of trees can be slow, as big
forests take longer to construct and perform more computations for predictions.
Hence hyperparameter optimisation may be sped up by limiting the range of the
`n_tree` parameter, or setting a single value.

## Random forest (ranger)

The second implementation of random forests comes from the `ranger` package
[@Wright2017-wc]. It is generally faster than the implementation in
`randomForestSRC` and allows for different splitting rules, such as maximally
selected rank statistics [@Lausen1992-qh; @Wright2017-sj] and concordance
index-based splitting [@Schmid2016-ie]. Hyperparameters for the random forest
are shown in the table below.

| **parameter** | **tag** | **values** | **optimised** | **comments** |
|:--------------|:--------|:----------:|:----------:|:-------------|
| signature size | `sign_size` | $\mathbb{Z} \in \left[1,n\right]$ | yes | -- |
| number of trees | `n_tree` | $\mathbb{Z} \in \left[0,\infty\right)$ | yes |  This parameter is expressed on the $\log_{2}$ scale, i.e. the actual input value will be $2^\texttt{n_tree}$ [@Oshiro2012-mq]. The default range is $\left[4, 10\right]$.|
| subsampling fraction | `sample_size` | $\mathbb{R} \in \left(0, 1.0\right]$ | yes | Fraction of available data that is used for to create a single tree. The default range is $\left[2 / m, 1.0\right]$, with $m$ the number of samples. |
| number of features at each node | `m_try` | $\mathbb{R} \in \left[0.0, 1.0\right]$ | yes | Familiar ensures that there is always at least one candidate feature. |
| node size | `node_size` | $\mathbb{Z} \in \left[1, \infty\right)$ | yes | Minimum number of unique samples in terminal nodes. The default range is $\left[5, \lfloor m / 3\rfloor\right]$, with $m$ the number of samples.|
| maximum tree depth | `tree_depth` | $\mathbb{Z} \in \left[1,\infty\right)$ | yes | Maximum depth to which trees are allowed to grow. The default range is $\left[1, 10\right]$. |
| splitting rule | `split_rule` | `gini`, `extratrees`, `variance`, `logrank`, `C`, `maxstat`| no | Default splitting rules are `gini` for `binomial` and `multinomial` outcomes and `maxstat` for `continuous`, `count` and `survival` outcomes.|
| significance split threshold | `alpha` | $\mathbb{R} \in \left(0.0, 1.0\right]$ | `maxstat` | Minimum significance level for further splitting. The default range is $\left[10^{-6}, 1.0\right]$ |
| sample weighting | `sample_weighting` | `inverse_number_of_samples`, `effective_number_of_samples`, `none` | no | Sample weighting allows for mitigating class imbalances. The default is `inverse_number_of_samples` for `binomial` and `multinomial` outcomes, and `none` otherwise. Instances with the majority class receive less weight. |
| beta | `sample_weighting_beta` | $\mathbb{Z} \in \left[-6,-1\right]$ | `effective_number_of_samples` only | Specifies the $\beta$ parameter for effective number of samples weighting [@Cui2019-ys]. It is expressed on the $\log_{10}$ scale: $\beta=1-10^\texttt{beta}$. |

Note that optimising the number of trees can be slow, as big forests take longer
to construct and perform more computations for predictions. Hence hyperparameter
optimisation may be sped up by limiting the range of the `n_tree` parameter or
setting a single value.

## Naive Bayes

The naive Bayes classifier uses Bayes rule to predict posterior probabilities.
The naive Bayes classifier uses the `e1071::naiveBayes` function
[@Meyer2021-aq]. The  hyperparameters of the classifier are shown in the table
below.

| **parameter** | **tag** | **values** | **optimised** | **comments** |
|:--------------|:--------|:----------:|:----------:|:-------------|
| signature size |`sign_size` | $\mathbb{Z} \in \left[1,n\right]$ | yes | -- |
| laplace smoothing | `laplace` | $\mathbb{R} \in \left[0.0, \infty \right)$ | yes | The default range is $\left[0, 10\right]$. |

## *k*-nearest neighbours

*k*-nearest neighbours is a simple clustering algorithm that classifies samples
based on the classes of their *k* nearest neighbours in feature space. The
`e1071::gknn` function is used to implement *k*-nearest neighbours
[@Meyer2021-aq]. The hyperparameters are shown in the table below.

| **parameter** | **tag** | **values** | **optimised** | **comments** |
|:--------------|:--------|:----------:|:----------:|:-------------|
| signature size |`sign_size` | $\mathbb{Z} \in \left[1,n\right]$ | yes | -- |
| number of nearest neighbours | `k` | $\mathbb{Z} \in \left[1,m\right]$ | yes | The default range is $\left[1, \lceil 2 m^{1/3} \rceil \right]$, with $m$ the number of samples. |
| distance metric | `distance_metric` | all metrics supported by `proxy::dist` | yes | The default set of metrics is `gower`, `euclidean` and `manhattan`.|

## Support vector machines

Support vector machines were originally defined to find optimal margins
between classes for classification problems. Support vector machines were
popularized after the use of the kernel trick was described. Using the kernel
trick the calculations are performed in an implicit high-dimensional feature
space, which is considerably more efficient than explicit calculations
[@Boser1992-nk].

Familiar implements SVM using `e1071::svm`. We tried `kernlab::ksvm`, but this
function would reproducibly freeze during unit testing. By default, both
features and outcome data are scaled internally. This was used to derive the
default hyperparameter ranges specified in the table below.

| **parameter** | **tag** | **values** | **optimised** | **comments** |
|:--------------|:--------|:----------:|:----------:|:-------------|
| signature size | `sign_size` | $\mathbb{Z} \in \left[1,n\right]$ | yes | -- |
| SVM kernel | `kernel` | `linear`, `polynomial`, `radial`, `sigmoid` | no | Default is `radial`, unless specified as part of the learner's name. |
| $C$ | `c` | $\mathbb{R}$ | yes | The cost for violation of constraints is expressed on a $\log_{10}$ scale, i.e. the actual value is $10^\texttt{c}$. The default range is $\left[-5, 3\right]$. |
| $\epsilon$ | `epsilon` | $\mathbb{R}$ | `continuous` and `count` outcomes | The error tolerance $\epsilon$ is only used for regression SVM. $\epsilon$ is expressed on a $\log_{10}$ scale. The default range is $\left[-5, 1\right]$. |
| $\nu$ | `nu` | $\mathbb{R}$ | `nu` SVM type | `nu` provides upper bounds for the fraction of training errors and lower bounds for the fraction of support vectors [@Chang2011-ck]. It is expressed on a $\log_{10}$ scale. The default range is $\left[-5, 1\right]$. |
| inverse kernel width | `gamma` | $\mathbb{R}$ | non-linear kernels | This parameter specifies the inverse of the kernel width. It is expressed on the $\log_{10}$ scale. The default range is $\left[-9, 3\right]$. |
| polynomial degree | `degree` | $\mathbb{Z} \in \left[1, \infty\right)$ | polynomial kernel | The default range is $\left[1, 5\right]$. |
| kernel offset | `offset` | $\mathbb{R} \in \left[0, \infty\right)$ | polynomial and sigmoid kernels |  Negative values are not allowed. The default range is $\left[0, 1 \right]$. |

Several types of SVM algorithms exist for classification and regression. The
following SVM types are implemented:

* $C$-classification: `c`
* $\nu$-classification: `nu`
* $\nu$-regression: `nu`
* $\epsilon$-regression: `eps`

The following kernels are implemented:

* Linear kernel: `linear`
* Polynomial kernel: `polynomial`
* Radial basis function kernel: `radial`
* Hyperbolic tangent (sigmoid) kernel: `sigmoid`


# Hyperparameter optimization

Hyperparameter optimisation is conducted to select model parameters that are
more likely to lead to generalisable results. The main hyperparameter
optimisation framework used by familiar is based on sequential model-based
optimisation (SMBO) [@Hutter2011-ea], but with significant updates and
extensions.

Overall, the following steps are conducted to optimise hyperparameters for a
given learner and dataset:

- A set of bootstraps of the development dataset is made. This allows for
training the models using the in-bag data and assessing performance using
out-of-bag data. This is independent of any bootstraps that may have been
defined as part of the experimental design. In case a subsampling method is
specified in the experimental design, the corresponding internal development
dataset is bootstrapped.

- Variable importance is determined for each of the in-bag datasets to avoid
positively biasing the optimisation process through information leakage.

- An initial exploration of the hyperparameter space is conducted. A model is
trained using each hyperparameter set and a bootstrap subsample. It is then
assessed by computing an optimisation score based on model performance in the
out-of-bag, and optionally, in-bag data.

- Exploration of the hyperparameter space then continues by iteratively
comparing the best known hyperparameter set against challenger hyperparameter
sets in so-called intensify steps. The challengers are selected by predicting
the expected model performance for a new hyperparameter set and computing its
utility [@Shahriari2016-kx]. In addition, we predict model run times, and
actively prune potential slow challengers. Utility is computed using an
acquisition function, of which several are available in familiar. The set of
challenger hyperparameter sets may moreover be iteratively pruned, dependent on
the exploration method.

- Exploration stops if no significant improvement to model performance could be
established by further exploring the hyperparameter space, the total number of
iterations has been reached, or bootstraps samples have been exhausted.

- If the best set of hyperparameters does not lead to a model that exceeds
performance of a naive model, e.g. one that always predicts the majority class
or median value, a naive model is trained instead.

## Predicting run time of model

Models take a certain time to train. Familiar is actively measuring this time
during hyperparameter optimisation for two reasons: first, to optimise
assignment of jobs to parallel nodes; and secondly, to prune potential challenger
sets that produce models that tend to run longer than the best-known model. The
strictness increases with increased exploration.

Let $m$ be the number of bootstraps used to quantify the most visited
hyperparameter sets, and $M$ the total number of bootstrap samples that can be
visited. Then let $t_{\texttt{opt}}$ be the runtime of the best known model. The
maximum time that a challenger hyperparameter set is allowed for training is
then empirically set to:

$$t_{\texttt{max}} = \left(5 - 4 \frac{m}{M} \right) t_{\texttt{opt}} $$

Hence, $t_{\texttt{max}}$ converges to $t_{\texttt{opt}}$ for $m \rightarrow M$.

If maximum runtime is relatively insignificant, i.e. $t_{\texttt{max}} < 10.0$
seconds, a threshold of $t_{\texttt{max}}= 10.0$ is used.

If the maximum runtime is not known, i.e. none of the hyperparameter sets
evaluated so far produced a valid model, the maximum time threshold is set to
infinite. In effect, no hyperparameter sets are pruned based on expected run
time.

A random forest is trained to infer runtimes for (new) hyperparameter sets,
based on the runtimes observed for visited hyperparameter sets. The random
forest subsequently infers runtime for a challenger hyperparameter set. The
runtime estimate is compared against $t_{\texttt{max}}$, and if it exceeds the
threshold, it is rejected and not evaluated.

## Assessing goodness of hyperparameter sets

A summary score $s$ determines how good a set of hyperparameters $\mathbf{x}$
is. This score is computed in two steps. First an objective score is computed to
assess model performance for a hyperparameter set in a bootstrap subsample.
Subsequently an optimisation score is computed within the subsample. Finally a
summary score is determined from the optimisation score over all subsamples.

An objective score $s''$ is computed from the performance metric value for a
specific hyperparameter set for in-bag (IB; $s''_{\textrm{IB}}$) and out-of-bag
data (OOB; $s''_{\textrm{OOB}}$). The objective score always lies in the interval
$[-1.0, 1.0]$. An objective score of $1.0$ always indicates the best possible
score. A score of $0.0$ indicates that the hyperparameter set leads to the same
degree of performance as the best educated guess, i.e. the majority class (for
binomial and multinomial outcomes), the median value (for continuous and count
outcomes), or tied risk or survival times (for survival outcomes). Objective
scores that are either missing or below $-1.0$ are truncated to the value of
$-1.0$. In case multiple metrics are used to assess model performance, the mean
of the respective objective scores is used.

The optimisation scores $s'$ and summary score $s$ of each hyperparameter set
are subsequently computed using one of the following functions, which can be set
using the `optimisation_function` parameter:

1. `validation`, `max_validation` (default): The optimisation score for each
bootstrap subsample is $s'=s''_{\textrm{OOB}}$. The summary score $s$ is then
the average of optimisation scores $s'$. This is a commonly used criterion that
tries to maximise the score on the OOB validation data.

2. `balanced`:  The optimisation score for each bootstrap subsample is
$s'=s''_{\textrm{OOB}} - \left| s''_{\textrm{OOB}} - s''_{\textrm{IB}} \right|$.
The summary score $s$ is then the average of optimisation scores $s'$. A
variation on `max_validation` with a penalty for differences in performance
between the IB and OOB data. The underlying idea is that a good set of
hyperparameters should lead to models that perform well on both development and
validation data.

3. `strong_balance`: The optimisation score for each bootstrap subsample is
$s'=s''_{\textrm{OOB}} - 2\left| s''_{\textrm{OOB}}-s''_{\textrm{IB}} \right|$.
The summary score $s$ is then the average of optimisation scores $s'$. A variant
of `balanced` with a stronger penalty term.

4. `validation_minus_sd`: The optimisation score for each bootstrap subsample is
$s'=s''_{\textrm{OOB}}$. The summary score $s$ is then the average of
optimisation scores $s'$ minus the standard deviation of $s'$. This penalises
hyperparameter sets that lead to wide variance on OOB data.

5. `validation_25th_percentile`: The optimisation score for each bootstrap
subsample is $s'=s''_{\textrm{OOB}}$. The summary score $s$ is the 25th
percentile of optimisation scores $s'$. Like `validation_minus_sd` this
penalises hyperparameter sets that lead to wide variance on OOB data.

6. `model_estimate`: The optimisation score for each bootstrap subsample is
$s'=s''_{\textrm{OOB}}$. The model used to infer utility of new hyperparameter
sets (see the next section [Predicting optimisation score for new hyperparameter
sets]) is used to predict the expected optimisation score $\mu(\mathbf{x})$.
This score is used as summary score $s$. Not available for random search.

7. `model_estimate_minus_sd`: The optimisation score for each bootstrap
subsample is $s'=s''_{\textrm{OOB}}$. The model used to infer utility of new
hyperparameter sets (see the next section [Predicting optimisation score for new
hyperparameter sets]) is used to predict the expected optimisation score
$\mu(\mathbf{x})$ and its standard deviation $\sigma(\mathbf{x})$. Then
$s = \mu(\mathbf{x}) - \sigma(\mathbf{x})$. This penalises hyperparameter sets
for which a large variance is expected. Not available for random search.

8. `model_balanced_estimate`: The optimisation score for each bootstrap
subsample is $s'=s''_{\textrm{OOB}} - \left| s''_{\textrm{OOB}} - s''_{\textrm{IB}} \right|$.
The model used to infer utility of new hyperparameter sets (see the next section
[Predicting optimisation score for new hyperparameter sets]) is used to predict
the expected optimisation score $\mu(\mathbf{x})$. This score is used as summary
score $s$. The method is similar to `model_estimate`, but predicts the balanced
score instead of the validation score. Not available for random search.

9. `model_balanced_estimate_minus_sd`: The optimisation score for each bootstrap subsample is
$s'=s''_{\textrm{OOB}} - \left| s''_{\textrm{OOB}} - s''_{\textrm{IB}} \right|$. 
The model used to infer utility of new hyperparameter sets (see the next section
[Predicting optimisation score for new hyperparameter sets]) is used to predict
the expected optimisation score $\mu(\mathbf{x})$ and its standard deviation
$\sigma(\mathbf{x})$. Then $s = \mu(\mathbf{x}) - \sigma(\mathbf{x})$. This
penalises hyperparameter sets for which a large variance is expected. The method
is similar to `model_estimate_minus_sd`, but predicts the balanced score and its
variance instead. Not available for random search.

The summary score is then used to select the best known hyperparameter set, i.e.
the set that maximises the summary score. Moreover, the optimisation scores are
used to identify candidate hyperparameter sets as described in the following
section. The optimisation function is set using the `optimisation_function`
argument.

## Predicting optimisation score for new hyperparameter sets

Any point in the hyperparameter space has a single, scalar, optimisation
score value that is *a priori* unknown. During the optimisation process,
the algorithm samples from the hyperparameter space by selecting
hyperparameter sets and computing the optimisation score for one or
more bootstraps. For each hyperparameter set the resulting scores are
distributed around the actual value.

A key point of Bayesian optimisation is the ability to estimate the usefulness
or utility of new hyperparameter sets. This ability is facilitated by modelling
the optimisation score of new hyperparameter sets using the optimisation scores
of observed hyperparameter sets.

The following models can be used for this purpose:

- `gaussian_process` (default): Creates a localised approximate Gaussian
deterministic Gaussian Process. This is implemented using the `laGP` package
[@Gramacy2016-aa].

- `bayesian_additive_regression_trees` or `bart`: Uses Bayesian Additive
Regression Trees for inference. Unlike standard random forests, BART allows for
estimating posterior distributions directly and can extrapolate. BART is
implemented using the `BART` package [@Sparapani2021-aa].

- `random_forest`: Creates a random forest for inference. A random forest was
originally used in the SMBO algorithm by Hutter et al. [@Hutter2011-ea]. A
weakness of random forests is their lack of extrapolation beyond observed
values, which limits their usefulness in exploiting promising areas of
hyperparameter space somewhat.

In addition, familiar can perform random search (`random` or `random_search`).
This forgoes the use of models to steer optimisation. Instead, the
hyperparameter space is sampled at random.

The learner used to predict optimisation scores can be specified using the
`hyperparameter_learner` parameter.

## Acquisition functions for utility of hyperparameter sets

The expected values and the posterior values of optimisation scores from the
models are used to compute utility of new hyperparameter sets using an
acquisition function. The following acquisition functions are available in
familiar [@Shahriari2016-kx]. Let $\nu=f(\mathbf{x})$ be the posterior
distribution of hyperparameter set $\mathbf{x}$, and $\tau$ the best observed
optimisation score. Let also $\mu(\mathbf{x})$ and $\sigma(\mathbf{x})$ be the
sample mean and sample standard deviation for set $\mathbf{x}$ (implicitly for
round $m$):

* `improvement_probability`: The probability of improvement quantifies the
  probability that the expected optimisation score for a set $\mathbf{x}$ is
  better than the best observed optimisation score $\tau$:

  $$\alpha(\mathbf{x}) = P(\nu > \tau) = \Phi \left(\frac{\mu(\mathbf{x}) - \tau} {\sigma(\mathbf{x})}\right)$$

  Here $\Phi$ is the cumulative distribution function of a normal distribution.
  Note that this acquisition function is typically prone to convergence to local
  minima, although in our algorithm this may be mitigated by the alternating
  random search strategy.

* `improvement_empirical_probability`: Similar to `improvement_probability`, but based
  directly on optimisation scores predicted by the individual decision trees:

  $$\alpha(\mathbf{x}) = P(\nu > \tau) = \frac{1}{N} \sum_{i=1}^N \left[\nu_i > \tau\right]$$

  with $\left[\ldots\right]$ denoting an Iverson bracket, which takes the value
  1 if the condition specified by $\ldots$ is true, and 0 otherwise.

* `expected_improvement` (default): Expected improvement is based on an improvement
  function, and is be computed as follows:

  $$\alpha(\mathbf{x})=(\mu(\mathbf{x})-\tau) \Phi \left(\frac{\mu(\mathbf{x}) - \tau}{\sigma(\mathbf{x})}\right) + \sigma(\mathbf{x}) \phi\left(\frac{\mu(\mathbf{x}) - \tau}{\sigma(\mathbf{x})}\right)$$

  with $\phi$ denoting the normal probability density function. If $\sigma(\mathbf{x}) = 0$,
  $\alpha(\mathbf{x})=0$.

* `upper_confidence_bound`: This acquisition function is based on the upper
  confidence bound of the distribution, and is defined as [@Srinivas2012-hq]:

  $$\alpha(\mathbf{x}, t)=\mu(\mathbf{x}) + \beta_t \sigma(\mathbf{x})$$

  Here the parameter $\beta_t$ depends on the current round $t$, which in our
  implementation would be roughly equivalent to the maximum number of bootstraps
  performed for any of the hyperparameter sets, minus $1$. The $\beta_t$ parameter
  is then defined as $\beta_t= 2 \log(|D|(t+1)^2\pi^2/6\delta)$
  [@Srinivas2012-hq]. Here, $\delta \in [0,1]$ and $|D|$ the dimensionality of the
  hyperparameter space. For their experiments Srinivas et al. used $\delta=0.1$,
  and noted that scaling down $\beta_t$ by a factor $5$ improves the algorithm.
  Hence, we use $\beta_t= \frac{2}{5} \log\left(\frac{5}{3} |D|(t+1)^2\pi^2\right)$.

* `bayes_upper_confidence_bound`: Proposed by Kaufmann et al. [@Kaufmann2012-kh],
  the Bayesian upper confidence bound is defined as:

  $$\alpha(\mathbf{x}, t)=Q \left(1-\frac{1}{(t + 1) \log(n)^c}, \nu_t\right)$$

  with $Q$ being a quantile function. Note that we here deviate from the original
  definition by substituting the original $t$ by $t+1$, and not constraining the
  posterior distribution $\nu$ to a known function. In the original, $n$ defines
  the number of rounds, with $t$ the current round. In our algorithm, this is
  interpreted as $n$ being the total number of bootstraps, and $t$ the maximum
  number of bootstraps already performed for any of the hyperparameter sets.
  Hence our implementation is the same, aside from the lack of constraint on the
  posterior distribution. Based on the simulations and recommendations by
  Kaufmann et al. we choose parameter $c=0$.

The acquisition function can be specified using the `acquisition_function`
parameter.

## Exploring challenger sets

After selecting challenger hyperparameter sets, the optimisation score of models
trained using these hyperparameter sets are compared against the best-known
score in a run-off intensify step. During this step, models will be trained on
new bootstrap data, compared, and then trained on another set of bootstraps, and
so on. Pruning the set of challenger hyperparameters after each iteration
reduces computational load. Familiar implements the following methods:

- `single_shot` (default): The set of alternative parameter sets is not pruned,
and each intensify iteration contains only a single (initial) intensification
step that only uses a single bootstrap. This is the exploration method that
under default settings does the least in-depth exploration, with the advantage
of being fast. Extensive internal tests show that this exploration method yields
similarly performant models as more extensive exploration methods.

- `successive_halving`: The set of alternative parameter sets is pruned by
removing the worst performing half of the sets after each step
[@Jamieson2016-fq]. The set of investigated parameter sets gets progressively
smaller.

- `stochastic_reject`: The set of alternative parameter sets is pruned by
comparing the performance of each parameter set with that of the incumbent best
parameter set using a paired Wilcoxon test.

- `none`: The set of alternative parameter sets is not pruned. Compared to
`single_shot`, hyperparameter sets are assessed using multiple bootstraps
(`smbo_step_bootstraps` * `smbo_intensify_steps`).

The method used to steer exploration can be set using the `exploration_method`
parameter.

## Providing hyperparameters manually

It is possible to set hyperparameters manually. This can be used to change
hyperparameters that are fixed by default, to set a fixed value for randomised
hyperparameters, or to provide a different search range for randomised
hyperparameters.

Hyperparameters can be provided using the `hyperparameter` tag. For the
`glm_logistic` learner an example tag may for example look as follows:
```
<hyperparameter>
  <glm_logistic>
    <sign_size>5</sign_size>
  </glm_logistic>
</hyperparameter>
```

Or as a nested list passed as the `hyperparameter` argument to `summon_familiar`:
```
hyperparameter = list("glm_logistic"=list("sign_size"=5))
```

More than one value can be provided. The behaviour changes depending on whether
the hyperparameter is categorical or numeric variable. In case of categorical
hyperparameters, the provided values define the search range. For numerical
hyperparameters, providing two values sets the bounds of the search range,
whereas providing more than two values will define the search range itself.

## Configuration options for hyperparameter optimisation

Hyperparameter optimization may be configured using the tags/arguments in the table below.

| **tag** / **argument** | **description** | **default** |
|:----------:|:-------------------------|:-----------:|
| `optimisation_bootstraps`  | Maximum number of bootstraps created for optimisation. | `20` |
| `optimisation_determine_vimp` | If `TRUE`, compute variable importance for each bootstrap. If `FALSE` use variable importance computed during the feature selection step. | `TRUE` |
| `smbo_random_initialisation` | If `random` initial parameter sets are generated randomly from default ranges. If `fixed` or `fixed_subsample`, the initial parameter sets are based on a grid in parameter space. | `fixed_subsample` |
| `smbo_n_random_sets` | Sets the number of hyperparameter sets drawn for initialisation. Ignored if `smbo_random_initialisation` is `fixed`. If `smbo_random_initialisation` is `fixed_subsample`, the number of selected hyperparameters may be lower. | `100` |
| `max_smbo_iterations` | Maximum number of intensify iterations of the SMBO algorithm. | `20` |
| `smbo_stop_convergent_iterations` | Number of subsequent convergent SMBO iterations required to stop hyperparameter optimisation early. | `3` |
| `smbo_stop_tolerance` | Tolerance for recent optimisation scores to determine convergence. | `0.1 * 1 / sqrt(n_samples)` |
| `smbo_time_limit` | Time limit (in minutes) for the optimisation process. | `NULL` |
| `smbo_initial_bootstraps` | Number of bootstraps assessed initially. | `1` |
| `smbo_step_bootstraps` | Number of bootstraps used within each step of an intensify iteration. | `3` |
| `smbo_intensify_steps` | Number of intensify steps within each intensify iteration. | `5` |
| `optimisation_metric` | The metric used for optimisation, e.g. `auc`. See the vignette on performance metrics for available options. More than one metric may be specified. | `auc_roc` (`binomial`, `multinomial`); `mse` (`continuous`); `msle` (`count`); and `concordance_index` (`survival`) |
| `optimisation_function` | The optimisation function used (see [Assessing goodness of hyperparameter sets]). | `balanced` |
| `hyperparameter_learner` | Learner used to predict optimisation scores for new hyperparameter sets (see [Predicting optimisation score for new hyperparameter sets]). | `gaussian_process` |
| `acquisition_function` | The function used to quantify utility of hyperparameter sets (see [Acquisition functions for utility of hyperparameter sets]). | `expected_improvement` |
| `exploration_method` | Method used to explore challenger hyperparameter sets (see [Exploring challenger sets]). | `single_shot` |
| `smbo_stochastic_reject_p_value` | The p-value level for stochastic pruning. | `0.05` |
| `parallel_hyperparameter_optimisation` | Enables parallel processing for hyperparameter optimisation. Ignored if `parallel=FALSE`. Can be `outer` to process multiple optimisation processes simultaneously. | `TRUE` |

Table: Configuration options for hyperparameter optimisation.

# Model recalibration

Even though learners may be good at discrimination, model calibration can be
lacklustre. Some learners are therefore recalibrated as follows:

1.  3-fold cross-validation is performed. Responses are predicted using a new
model trained on each training fold.

2.  Three recalibration models are trained using the responses for the training
folds as input [@Niculescu-Mizil2005-kj].

3.  The three recalibration models are then applied to (new) responses of the
full model, and the resulting values averaged for each sample.

The following learners currently undergo recalibration:

| **learner** | **outcome** | **recalibration model** |
|:------------|:-----------:|:-------------|
|`xgboost_lm`, `xgboost_lm_logistic`| binomial, multinomial| `glm_logistic`|
|`xgboost_tree`, `xgboost_tree_logistic`| binomial, multinomial| `glm_logistic`|
|`xgboost_lm_cox`| survival | `glm_cox`|
|`xgboost_tree_cox`| survival | `glm_cox`|
|`boosted_glm_cindex`, `boosted_glm_gehan`| survival | `glm_cox`|
|`boosted_tree_cindex`, `boosted_tree_gehan`| survival | `glm_cox`|

Familiar currently does not recalibrate models by inverting the model
calibration curves, but may do so in the future.

# References