Create a random forest with surrogates.
Source:R/RandomForestSurrogates.R
RandomForestSurrogates.RdThe object created by this functions serves as the basis of most further analysis,
such as SurrogateMinimalDepth() and MeanAdjustedAgreement().
Usage
RandomForestSurrogates(
x = NULL,
y = NULL,
s.pct = 0.01,
s = ceiling(ncol(x) * ifelse(s.pct > 0 && s.pct <= 1, s.pct, 0.01)),
mtry = c("^3/4", "sqrt", "0.5"),
type = c("regression", "classification", "survival"),
status = NULL,
num.trees = 500,
num.threads = 1,
min.node.size = 1,
permutate = FALSE,
seed = NULL,
preschedule.threads = TRUE,
...
)Arguments
- x, y
Predictor data and dependent variables.
- s.pct, s
Number of surrogate splits. This can be defined either by setting
s.pctto a number between 0 and 1, or providing an exact value fors.s.pct: Percentage of variables to use fors. (Default: 0.01)s: Number of surrogate splits. (Default: Number of variables multiplied bys.pct, which defaults to 0.01; Ifs.pctis less than or equal to zero, or greater than 1: 0.01 is used instead.)
- mtry
Number of variables to possibly split at in each node. Default is the (rounded down) number of variables to the power of three quarters (Ishwaran, 2011). Alternatively, a single argument function returning an integer, given the number of independent variables.
- type
The type of random forest to create with ranger. One of
"regression"(Default),"classification"or"survival".- status
If
type = "regression": Survival forest status variable. Use 1 for event and 0 for censoring. Length must matchy.- num.trees
Number of trees.
- num.threads
Number of threads to parallelize with. (Default: 1)
- min.node.size
Minimal node size to split at. (Default: 1)
- permutate
Enable to permutate
xforMutualForestImpact()(Default: FALSE).- seed
RNG seed. It is strongly recommended that you set this value.
- preschedule.threads
(Default: TRUE) Passed as
mc.prescheduletoparallel::mclapply()inaddSurrogates().- ...
Arguments passed on to
ranger::rangerimportanceVariable importance mode, one of 'none', 'impurity', 'impurity_corrected', 'permutation'. The 'impurity' measure is the Gini index for classification, the variance of the responses for regression and the sum of test statistics (see
splitrule) for survival.min.bucketMinimal terminal node size. No nodes smaller than this value can occur. Default 3 for survival and 1 for all other tree types.
max.depthMaximal tree depth. A value of NULL or 0 (the default) corresponds to unlimited depth, 1 to tree stumps (1 split per tree).
replaceSample with replacement.
sample.fractionFraction of observations to sample. Default is 1 for sampling with replacement and 0.632 for sampling without replacement. For classification, this can be a vector of class-specific values.
case.weightsWeights for sampling of training observations. Observations with larger weights will be selected with higher probability in the bootstrap (or subsampled) samples for the trees.
class.weightsWeights for the outcome classes (in order of the factor levels) in the splitting rule (cost sensitive learning). Classification and probability prediction only. For classification the weights are also applied in the majority vote in terminal nodes.
splitruleSplitting rule. For classification and probability estimation "gini", "extratrees" or "hellinger" with default "gini". For regression "variance", "extratrees", "maxstat" or "beta" with default "variance". For survival "logrank", "extratrees", "C" or "maxstat" with default "logrank".
num.random.splitsFor "extratrees" splitrule.: Number of random splits to consider for each candidate splitting variable.
alphaFor "maxstat" splitrule: Significance threshold to allow splitting.
minpropFor "maxstat" splitrule: Lower quantile of covariate distribution to be considered for splitting.
split.select.weightsNumeric vector with weights between 0 and 1, used to calculate the probability to select variables for splitting. Alternatively, a list of size num.trees, containing split select weight vectors for each tree can be used.
always.split.variablesCharacter vector with variable names to be always selected in addition to the
mtryvariables tried for splitting.scale.permutation.importanceScale permutation importance by standard error as in (Breiman 2001). Only applicable if permutation variable importance mode selected.
local.importanceCalculate and return local importance values as in (Breiman 2001). Only applicable if
importanceis set to 'permutation'.regularization.factorRegularization factor (gain penalization), either a vector of length p or one value for all variables.
regularization.usedepthConsider the depth in regularization.
inbagManually set observations per tree. List of size num.trees, containing inbag counts for each observation. Can be used for stratified sampling.
holdoutHold-out mode. Hold-out all samples with case weight 0 and use these for variable importance and prediction error.
quantregPrepare quantile prediction as in quantile regression forests (Meinshausen 2006). Regression only. Set
keep.inbag = TRUEto prepare out-of-bag quantile prediction.oob.errorCompute OOB prediction error. Set to
FALSEto save computation time, e.g. for large survival forests.save.memoryUse memory saving (but slower) splitting mode. No effect for survival and GWAS data. Warning: This option slows down the tree growing, use only if you encounter memory problems.
verboseShow computation status and estimated runtime.
Value
A RandomForestSurrogates S3 object.
trees: List of all trees with surrogate analysis. (Class:SurrogateTrees,LayerTrees,RangerTrees)ranger: The ranger::ranger model used to obtain the trees.s: The number of surrogates investigated.