Skip to contents

Trains a new model from molecule SMILES to predict retention times (RT) using the specified method.

Usage

train_frm(
  df = read_rp_xlsx(),
  method = "lasso",
  verbose = 1,
  nfolds = 5,
  nw = 1,
  degree_polynomial = 1,
  interaction_terms = FALSE,
  rm_near_zero_var = TRUE,
  rm_na = TRUE,
  rm_ns = FALSE,
  seed = NULL
)

Arguments

df

A dataframe with columns "NAME", "RT", "SMILES" and optionally a set of chemical descriptors. If no chemical descriptors are provided, they are calculated using the function preprocess_data().

method

A string representing the prediction algorithm. Either "lasso", "ridge" or "gbtree".

verbose

A logical value indicating whether to print progress messages.

nfolds

An integer representing the number of folds for cross validation.

nw

An integer representing the number of workers for parallel processing.

degree_polynomial

An integer representing the degree of the polynomial. Polynomials up to the specified degree are included in the model.

interaction_terms

A logical value indicating whether to include interaction terms in the model.

rm_near_zero_var

A logical value indicating whether to remove near zero variance predictors. Setting this to TRUE can cause the CV results to be overoptimistic, as the variance filtering is done on the whole dataset, i.e. information from the test folds is used for feature selection.

rm_na

A logical value indicating whether to remove NA values. Setting this to TRUE can cause the CV results to be overoptimistic, as the variance filtering is done on the whole dataset, i.e. information from the test folds is used for feature selection.

rm_ns

A logical value indicating whether to remove chemical descriptors that were considered as not suitable for linear regression based on previous analysis of an independent dataset. See check_lm_suitability() for details on the analysis.

seed

An integer value to set the seed for random number generation to allow for reproducible results.

Value

A trained FastRet model.

Details

Setting rm_near_zero_var and/or rm_na to TRUE can cause the CV results to be overoptimistic, as the predictor filtering is done on the whole dataset, i.e. information from the test folds is used for feature selection.

Examples

# \donttest{
system.time(m <- train_frm(RP[1:80, ], method = "lasso", nfolds = 2, nw = 1, verbose = 0))
#>    user  system elapsed 
#>   0.451   0.001   0.451 
# For the sake of a short runtime, only the first 80 rows of the RP dataset
# are used in this example. In practice, you should always use the entire
# training dataset for model training.
# }