Trains a new model from molecule SMILES to predict retention times (RT) using the specified method.
Usage
train_frm(
df = read_rp_xlsx(),
method = "lasso",
verbose = 1,
nfolds = 5,
nw = 1,
degree_polynomial = 1,
interaction_terms = FALSE,
rm_near_zero_var = TRUE,
rm_na = TRUE,
rm_ns = FALSE,
seed = NULL
)
Arguments
- df
A dataframe with columns "NAME", "RT", "SMILES" and optionally a set of chemical descriptors. If no chemical descriptors are provided, they are calculated using the function
preprocess_data()
.- method
A string representing the prediction algorithm. Either "lasso", "ridge" or "gbtree".
- verbose
A logical value indicating whether to print progress messages.
- nfolds
An integer representing the number of folds for cross validation.
- nw
An integer representing the number of workers for parallel processing.
- degree_polynomial
An integer representing the degree of the polynomial. Polynomials up to the specified degree are included in the model.
- interaction_terms
A logical value indicating whether to include interaction terms in the model.
- rm_near_zero_var
A logical value indicating whether to remove near zero variance predictors. Setting this to TRUE can cause the CV results to be overoptimistic, as the variance filtering is done on the whole dataset, i.e. information from the test folds is used for feature selection.
- rm_na
A logical value indicating whether to remove NA values. Setting this to TRUE can cause the CV results to be overoptimistic, as the variance filtering is done on the whole dataset, i.e. information from the test folds is used for feature selection.
- rm_ns
A logical value indicating whether to remove chemical descriptors that were considered as not suitable for linear regression based on previous analysis of an independent dataset. See
check_lm_suitability()
for details on the analysis.- seed
An integer value to set the seed for random number generation to allow for reproducible results.
Details
Setting rm_near_zero_var
and/or rm_na
to TRUE can cause the CV results to be overoptimistic, as the predictor filtering is done on the whole dataset, i.e. information from the test folds is used for feature selection.
Examples
# \donttest{
system.time(m <- train_frm(RP[1:80, ], method = "lasso", nfolds = 2, nw = 1, verbose = 0))
#> user system elapsed
#> 0.441 0.004 0.444
# For the sake of a short runtime, only the first 80 rows of the RP dataset
# are used in this example. In practice, you should always use the entire
# training dataset for model training.
# }