Variable selection with mutual impurity reduction (MIR).
Source:R/variable_selection_mir.R
var.select.mir.Rd
This function executes MIR applying ranger for random forests generation and actual impurity reduction and a modified version of rpart to find surrogate variables.
Usage
var.select.mir(
x = NULL,
y = NULL,
num.trees = 500,
type = "regression",
s = NULL,
mtry = NULL,
min.node.size = 1,
num.threads = NULL,
status = NULL,
save.ranger = FALSE,
save.memory = FALSE,
num.permutations = 100,
p.t.sel = 0.01,
p.t.rel = 0.01,
select.var = TRUE,
select.rel = FALSE,
case.weights = NULL,
corr.rel = TRUE,
t = 5,
method.rel = "permutation",
method.sel = "janitza",
save.rel = TRUE
)
Arguments
- x
data.frame of predictor variables with variables in columns and samples in rows (Note: missing values are not allowed)
- y
vector with values of phenotype variable (Note: will be converted to factor if classification mode is used). For survival forests this is the time variable.
- num.trees
number of trees. Default is 500.
- type
mode of prediction ("regression", "classification" or "survival"). Default is regression.
- s
predefined number of surrogate splits (it may happen that the actual number of surrogate splits differs in individual nodes). Default is 1 percent of no. of variables.
- mtry
number of variables to possibly split at in each node. Default is no. of variables^(3/4) ("^3/4") as recommended by (Ishwaran 2011). Also possible is "sqrt" and "0.5" to use the square root or half of the no. of variables.
- min.node.size
minimal node size. Default is 1.
- num.threads
number of threads used for parallel execution. Default is number of CPUs available.
- status
status variable, only applicable to survival data. Use 1 for event and 0 for censoring.
- save.ranger
set TRUE if ranger object should be saved. Default is that ranger object is not saved (FALSE).
- save.memory
Use memory saving (but slower) splitting mode. No effect for survival and GWAS data. Warning: This option slows down the tree growing, use only if you encounter memory problems. (This parameter is transfered to ranger)
- num.permutations
number of permutations to determine p-values. Default is 100. (the relations are determined once based on the permuted X data and the utilized AIR values are permuted again for each permutation )
- p.t.sel
p.value threshold for selection of important variables. Default is 0.01.
- p.t.rel
p.value threshold for selection of related variables. Default is 0.01.
- select.var
set False if only importance should be calculated and no variables should be selected.
- select.rel
set False if only relations should be calculated and no variables should be selected.
- case.weights
Weights for sampling of training observations. Observations with larger weights will be selected with higher probability in the bootstrap (or subsampled) samples for the trees.
- corr.rel
set FALSE if non-corrected variable relations should be used for calculation of MIR. In this case the method "janitza" should not be used for selection of important variables
- t
variable to calculate threshold for non-corrected relation analysis. Default is 5.
- method.rel
Method to compute p-values for selection of related variables with var.relations.corr. Use "janitza" for the method by Janitza et al. (2016) or "permutation" to utilize permuted variables.
- method.sel
Method to compute p-values for selection of important variables. Use "janitza" for the method by Janitza et al. (2016) (can only be used when corrected variable relations are utilized) or "permutation" to utilize permuted variables.
- save.rel
set FALSE if relation information should not bet saved (default is TRUE)
Value
List with the following components:
info
: list with results containing:MIR
: the calculated variable importance for each variable based on mutual impurity reduction.pvalue
: the obtained p-values for each variable.selected
: variables has been selected (1) or not (0).relations
: a list containing the results of variable relation analysis.parameters
: a list that contains the parameters s, type, mtry, p.t.sel, p.t.rel and method.sel that were used.
var
: vector of selected variables.ranger
: ranger object.
References
Nembrini, S. et al. (2018) The revival of the Gini importance? Bioinformatics, 34, 3711–3718. https://academic.oup.com/bioinformatics/article/34/21/3711/4994791
Seifert, S. et al. (2019) Surrogate minimal depth as an importance measure for variables in random forests. Bioinformatics, 35, 3663–3671. https://academic.oup.com/bioinformatics/article/35/19/3663/5368013
Examples
# \donttest{
data("SMD_example_data")
# select variables (usually more trees are needed)
set.seed(42)
res <- var.select.mir(
x = SMD_example_data[, 2:ncol(SMD_example_data)],
y = SMD_example_data[, 1], s = 10, num.trees = 10, num.threads = 1
)
#> Warning: Relations for 10 original variables were not calculated because they were never used as a primary split.
#> Affected relations are set to 0.
#> Warning: Relations for 10 permuted variables were not calculated because they were not used as a primary split.
#> Affected relations are set to 0.
#> Warning: Only few negative importance values found for selection of important variables, inaccurate p-values. Consider the 'permutation' approach.
res$var
#> [1] "X2" "X3" "X4" "cp1_3" "cp1_5" "cp2_1" "cgn_66"
# }