Trains a new model from molecule SMILES to predict retention times (RT) using the specified method.
Usage
train_frm(
df,
method = "lasso",
verbose = 1,
nfolds = 5,
nw = 1,
degree_polynomial = 1,
interaction_terms = FALSE,
rm_near_zero_var = TRUE,
rm_na = TRUE,
rm_ns = FALSE,
seed = NULL,
do_cv = TRUE
)Arguments
- df
A dataframe with columns "NAME", "RT", "SMILES" and optionally a set of chemical descriptors. If no chemical descriptors are provided, they are calculated using the function
preprocess_data().- method
A string representing the prediction algorithm. Either "lasso", "ridge", "gbtree", "gbtreeDefault" or "gbtreeRP". Method "gbtree" is an alias for "gbtreeDefault".
- verbose
A logical value indicating whether to print progress messages.
- nfolds
An integer representing the number of folds for cross validation.
- nw
An integer representing the number of workers for parallel processing.
- degree_polynomial
An integer representing the degree of the polynomial. Polynomials up to the specified degree are included in the model.
- interaction_terms
A logical value indicating whether to include interaction terms in the model.
- rm_near_zero_var
A logical value indicating whether to remove near zero variance predictors.
- rm_na
A logical value indicating whether to remove NA values before training. Highly recommended to avoid issues during model fitting. Setting this to FALSE with
method = "lasso"will most likely lead to errors.- rm_ns
A logical value indicating whether to remove chemical descriptors that were considered as not suitable for linear regression based on a previous analysis of an independent dataset. Currently not used.
- seed
An integer value to set the seed for random number generation to allow for reproducible results.
- do_cv
A logical value indicating whether to perform cross-validation. If FALSE, the
cvelement in the returned object will be NULL.
Value
A 'FastRet Model', i.e., an object of class frm. Components are:
model: The fitted base model. This can be an object of classglmnet(for Lasso or Ridge regression) orxgb.Booster(for GBTree models).df: The data frame used for training the model. The data frame contains all user-provided columns (including mandatory columns RT, SMILES and NAME) as well the calculated chemical descriptors. (But no interaction terms or polynomial features, as these can be recreated within a few milliseconds).cv: A named list containing the cross validation results, or NULL ifdo_cv = FALSE. When not NULL, elements are:folds: A list of integer vectors specifying the samples in each fold.models: A list of models trained on each fold.stats: A list of vectors with RMSE, Rsquared, MAE, pBelow1Min per fold.preds: Retention time predictions obtained in CV as numeric vector.
seed: The seed used for random number generation.version: The version of the FastRet package used to train the model.args: The value of function arguments besidesdfas named list.