Create a random forest with surrogates.
Source:R/RandomForestSurrogates.R
RandomForestSurrogates.Rd
The object created by this functions serves as the basis of most further analysis,
such as SurrogateMinimalDepth()
and MeanAdjustedAgreement()
.
Usage
RandomForestSurrogates(
x = NULL,
y = NULL,
s.pct = 0.01,
s = ceiling(ncol(x) * ifelse(s.pct > 0 && s.pct <= 1, s.pct, 0.01)),
mtry = c("^3/4", "sqrt", "0.5"),
type = c("regression", "classification", "survival"),
status = NULL,
num.trees = 500,
num.threads = 1,
min.node.size = 1,
permutate = FALSE,
seed = NULL,
preschedule.threads = TRUE,
...
)
Arguments
- x, y
Predictor data and dependent variables.
- s.pct, s
Number of surrogate splits. This can be defined either by setting
s.pct
to a number between 0 and 1, or providing an exact value fors
.s.pct
: Percentage of variables to use fors
. (Default: 0.01)s
: Number of surrogate splits. (Default: Number of variables multiplied bys.pct
, which defaults to 0.01; Ifs.pct
is less than or equal to zero, or greater than 1: 0.01 is used instead.)
- mtry
Number of variables to possibly split at in each node. Default is the (rounded down) number of variables to the power of three quarters (Ishwaran, 2011). Alternatively, a single argument function returning an integer, given the number of independent variables.
- type
The type of random forest to create with ranger. One of
"regression"
(Default),"classification"
or"survival"
.- status
If
type = "regression"
: Survival forest status variable. Use 1 for event and 0 for censoring. Length must matchy
.- num.trees
Number of trees.
- num.threads
Number of threads to parallelize with. (Default: 1)
- min.node.size
Minimal node size to split at. (Default: 1)
- permutate
Enable to permutate
x
forMutualForestImpact()
(Default: FALSE).- seed
RNG seed. It is strongly recommended that you set this value.
- preschedule.threads
(Default: TRUE) Passed as
mc.preschedule
toparallel::mclapply()
inaddSurrogates()
.- ...
Arguments passed on to
ranger::ranger
importance
Variable importance mode, one of 'none', 'impurity', 'impurity_corrected', 'permutation'. The 'impurity' measure is the Gini index for classification, the variance of the responses for regression and the sum of test statistics (see
splitrule
) for survival.min.bucket
Minimal terminal node size. No nodes smaller than this value can occur. Default 3 for survival and 1 for all other tree types.
max.depth
Maximal tree depth. A value of NULL or 0 (the default) corresponds to unlimited depth, 1 to tree stumps (1 split per tree).
replace
Sample with replacement.
sample.fraction
Fraction of observations to sample. Default is 1 for sampling with replacement and 0.632 for sampling without replacement. For classification, this can be a vector of class-specific values.
case.weights
Weights for sampling of training observations. Observations with larger weights will be selected with higher probability in the bootstrap (or subsampled) samples for the trees.
class.weights
Weights for the outcome classes (in order of the factor levels) in the splitting rule (cost sensitive learning). Classification and probability prediction only. For classification the weights are also applied in the majority vote in terminal nodes.
splitrule
Splitting rule. For classification and probability estimation "gini", "extratrees" or "hellinger" with default "gini". For regression "variance", "extratrees", "maxstat" or "beta" with default "variance". For survival "logrank", "extratrees", "C" or "maxstat" with default "logrank".
num.random.splits
For "extratrees" splitrule.: Number of random splits to consider for each candidate splitting variable.
alpha
For "maxstat" splitrule: Significance threshold to allow splitting.
minprop
For "maxstat" splitrule: Lower quantile of covariate distribution to be considered for splitting.
split.select.weights
Numeric vector with weights between 0 and 1, used to calculate the probability to select variables for splitting. Alternatively, a list of size num.trees, containing split select weight vectors for each tree can be used.
always.split.variables
Character vector with variable names to be always selected in addition to the
mtry
variables tried for splitting.scale.permutation.importance
Scale permutation importance by standard error as in (Breiman 2001). Only applicable if permutation variable importance mode selected.
local.importance
Calculate and return local importance values as in (Breiman 2001). Only applicable if
importance
is set to 'permutation'.regularization.factor
Regularization factor (gain penalization), either a vector of length p or one value for all variables.
regularization.usedepth
Consider the depth in regularization.
inbag
Manually set observations per tree. List of size num.trees, containing inbag counts for each observation. Can be used for stratified sampling.
holdout
Hold-out mode. Hold-out all samples with case weight 0 and use these for variable importance and prediction error.
quantreg
Prepare quantile prediction as in quantile regression forests (Meinshausen 2006). Regression only. Set
keep.inbag = TRUE
to prepare out-of-bag quantile prediction.oob.error
Compute OOB prediction error. Set to
FALSE
to save computation time, e.g. for large survival forests.save.memory
Use memory saving (but slower) splitting mode. No effect for survival and GWAS data. Warning: This option slows down the tree growing, use only if you encounter memory problems.
verbose
Show computation status and estimated runtime.
Value
A RandomForestSurrogates S3 object.
trees
: List of all trees with surrogate analysis. (Class:SurrogateTrees
,LayerTrees
,RangerTrees
)ranger
: The ranger::ranger model used to obtain the trees.s
: The number of surrogates investigated.