Create a random forest with surrogates. — RandomForestSurrogates • RFSurrogates

The object created by this functions serves as the basis of most further analysis, such as SurrogateMinimalDepth() and MeanAdjustedAgreement().

Usage

RandomForestSurrogates(
  x = NULL,
  y = NULL,
  s.pct = 0.01,
  s = ceiling(ncol(x) * ifelse(s.pct > 0 && s.pct <= 1, s.pct, 0.01)),
  mtry = c("^3/4", "sqrt", "0.5"),
  type = c("regression", "classification", "survival"),
  status = NULL,
  num.trees = 500,
  num.threads = 1,
  min.node.size = 1,
  permutate = FALSE,
  seed = NULL,
  preschedule.threads = TRUE,
  ...
)

Arguments

x, y

Predictor data and dependent variables.

s.pct, s

Number of surrogate splits. This can be defined either by setting s.pct to a number between 0 and 1, or providing an exact value for s.

s.pct: Percentage of variables to use for s. (Default: 0.01)
s: Number of surrogate splits. (Default: Number of variables multiplied by s.pct, which defaults to 0.01; If s.pct is less than or equal to zero, or greater than 1: 0.01 is used instead.)

mtry

Number of variables to possibly split at in each node. Default is the (rounded down) number of variables to the power of three quarters (Ishwaran, 2011). Alternatively, a single argument function returning an integer, given the number of independent variables.

type

The type of random forest to create with ranger. One of "regression" (Default), "classification" or "survival".

status

If type = "regression": Survival forest status variable. Use 1 for event and 0 for censoring. Length must match y.

num.trees

Number of trees.

num.threads

Number of threads to parallelize with. (Default: 1)

min.node.size

Minimal node size to split at. (Default: 1)

permutate

Enable to permutate x for MutualForestImpact() (Default: FALSE).

seed

RNG seed. It is strongly recommended that you set this value.

preschedule.threads

(Default: TRUE) Passed as mc.preschedule to parallel::mclapply() in addSurrogates().

...

Arguments passed on to ranger::ranger

importance: Variable importance mode, one of 'none', 'impurity', 'impurity_corrected', 'permutation'. The 'impurity' measure is the Gini index for classification, the variance of the responses for regression and the sum of test statistics (see splitrule) for survival.
min.bucket: Minimal terminal node size. No nodes smaller than this value can occur. Default 3 for survival and 1 for all other tree types.
max.depth: Maximal tree depth. A value of NULL or 0 (the default) corresponds to unlimited depth, 1 to tree stumps (1 split per tree).
replace: Sample with replacement.
sample.fraction: Fraction of observations to sample. Default is 1 for sampling with replacement and 0.632 for sampling without replacement. For classification, this can be a vector of class-specific values.
case.weights: Weights for sampling of training observations. Observations with larger weights will be selected with higher probability in the bootstrap (or subsampled) samples for the trees.
class.weights: Weights for the outcome classes (in order of the factor levels) in the splitting rule (cost sensitive learning). Classification and probability prediction only. For classification the weights are also applied in the majority vote in terminal nodes.
splitrule: Splitting rule. For classification and probability estimation "gini", "extratrees" or "hellinger" with default "gini". For regression "variance", "extratrees", "maxstat" or "beta" with default "variance". For survival "logrank", "extratrees", "C" or "maxstat" with default "logrank".
num.random.splits: For "extratrees" splitrule.: Number of random splits to consider for each candidate splitting variable.
alpha: For "maxstat" splitrule: Significance threshold to allow splitting.
minprop: For "maxstat" splitrule: Lower quantile of covariate distribution to be considered for splitting.
split.select.weights: Numeric vector with weights between 0 and 1, used to calculate the probability to select variables for splitting. Alternatively, a list of size num.trees, containing split select weight vectors for each tree can be used.
always.split.variables: Character vector with variable names to be always selected in addition to the mtry variables tried for splitting.
scale.permutation.importance: Scale permutation importance by standard error as in (Breiman 2001). Only applicable if permutation variable importance mode selected.
local.importance: Calculate and return local importance values as in (Breiman 2001). Only applicable if importance is set to 'permutation'.
regularization.factor: Regularization factor (gain penalization), either a vector of length p or one value for all variables.
regularization.usedepth: Consider the depth in regularization.
inbag: Manually set observations per tree. List of size num.trees, containing inbag counts for each observation. Can be used for stratified sampling.
holdout: Hold-out mode. Hold-out all samples with case weight 0 and use these for variable importance and prediction error.
quantreg: Prepare quantile prediction as in quantile regression forests (Meinshausen 2006). Regression only. Set keep.inbag = TRUE to prepare out-of-bag quantile prediction.
oob.error: Compute OOB prediction error. Set to FALSE to save computation time, e.g. for large survival forests.
save.memory: Use memory saving (but slower) splitting mode. No effect for survival and GWAS data. Warning: This option slows down the tree growing, use only if you encounter memory problems.
verbose: Show computation status and estimated runtime.

Value

A RandomForestSurrogates S3 object.

trees: List of all trees with surrogate analysis. (Class: SurrogateTrees, LayerTrees, RangerTrees)
ranger: The ranger::ranger model used to obtain the trees.
s: The number of surrogates investigated.