The goal of this function is to train a model that predicts RT_ADJ (retention time measured on a new, adjusted column) from RT (retention time measured on the original column) and to attach this adjustment model to an existing FastRet model.
Usage
adjust_frm(
frm,
new_data,
predictors = 1:6,
nfolds = 5,
verbose = 1,
seed = NULL,
do_cv = TRUE,
adj_type = "lm",
add_cds = NULL,
match_rts = TRUE,
match_keys = NULL
)Arguments
- frm
An object of class
frmas returned bytrain_frm().- new_data
Data frame with required columns "RT", "NAME", "SMILES"; optional "INCHIKEY". "RT" must be the retention time measured on the adjusted column. Each row must match at least one row in
frm$df. The exact matching behavior is described in 'Details'.- predictors
Numeric vector specifying which transformations to include in the model. Available options are: 1=RT, 2=RT^2, 3=RT^3, 4=log(RT), 5=exp(RT), 6=sqrt(RT). Note that predictor 1 (RT) is always included, even if not specified explicitly.
- nfolds
The number of folds for cross validation.
- verbose
Show progress messages?
- seed
An integer value to set the seed for random number generation to allow for reproducible results.
- do_cv
A logical value indicating whether to perform cross-validation. If FALSE, the
cvelement in the returned adjustment object will be NULL.- adj_type
A string representing the adjustment model type. Either "lm", "lasso", "ridge", or "gbtree".
- add_cds
A logical value indicating whether to add chemical descriptors as predictors to new data. Default is TRUE if
adj_typeis "lasso", "ridge" or "gbtree" and FALSE ifadj_typeis "lm".- match_rts
Logical indicating how to obtain the RT feature used during training of the adjustment model. If TRUE (default), RT is obtained by matching rows in
new_datato rows infrm$dfbased onmatch_keys. If FALSE, RT is obtained by applying the base model tonew_data.- match_keys
Which columns to use for matching
new_datatofrm$dfifmatch_rtsis TRUE. Can be any combination of "INCHIKEY", "SMILES" and "NAME". If left at NULL, matching is done viac("SMILES", "INCHIKEY")if INCHIKEYs are available, else viac("SMILES", "NAME"). See 'Details' for more information on the matching procedure.
Value
An object of class frm, as returned by train_frm(), but with an
additional element adj containing the adjustment model. Components of adj
are:
model: The fitted adjustment model. Class depends onadj_typeand is one oflm,glmnet, orxgb.Booster.df: The data frame used for training the adjustment model. Including columns "NAME", "SMILES", "RT", "RT_ADJ" and optionally "INCHIKEY", as well as any additional predictors specified via thepredictorsargument.cv: A named list containing the cross validation results (see 'Details'), or NULL ifdo_cv = FALSE. When not NULL, elements are:folds: A list of integer vectors specifying the samples in each fold.models: A list of adjustment models trained on each fold.stats: A list of vectors with RMSE, Rsquared, MAE, pBelow1Min per fold. Added with v1.3.0.preds: Retention time predictions obtained during CV by applying the adjustment model to the hold-out data.preds_adjonly: Removed (i.e. NULL) since v1.3.0.
args: Function arguments used for adjustment (excludingfrm,new_dataandverbose). Added with v1.3.0.version: The version of the FastRet package used to train the adjustment model. Added with v1.3.0.
Details
Matching is done via c("SMILES", "INCHIKEY") if both datasets have non-missing
INCHIKEYs for all rows; otherwise via c("SMILES", "NAME"). This can be
overridden by supplying match_keys. If match_rts = FALSE, no matching is
performed and retention times are generated by calling predict(frm, new_data, adjust = FALSE). If multiple rows in frm$df match the same row in
new_data, their RT values are averaged first, and this average is used for
training the adjustment model.
Example: if frm$df equals data.frame OLD shown below and new_data equals
data.frame NEW, then the resulting, paired data.frame will look like PAIRED.
OLD <- data.frame(
NAME = c("A", "B", "B", "C" ),
SMILES = c("C", "CC", "CC", "CCC"),
RT = c(5.0, 8.0, 8.2, 9.0 )
)
NEW <- data.frame(
NAME = c("A", "B", "B", "B"),
SMILES = c("C", "CC", "CC", "CC"),
RT = c(2.5, 5.5, 5.7, 5.6)
)
PAIRED <- data.frame(
NAME = c("A", "B", "B", "B"),
SMILES = c("C", "CC", "CC", "CC"),
RT = c(5.0, 8.1, 8.1, 8.1), # Average of OLD$RT[2:3]
RT_ADJ = c(2.5, 5.5, 5.7, 5.6) # Taken from NEW
)If do_cv is TRUE, the adjustment procedure is evaluated in
cross-validation. However, care must be taken when interpreting the CV
results, as the model performance depends on both the adjustment layer and
the original model, which was trained on the full base dataset. Therefore,
the observed CV metrics should be read as "expected performance when
predicting RTs for molecules that were part of the base-model training but
not part of the adjustment set" instead of "expected performance when
predicting RTs for completely new molecules".
Examples
frm <- read_rp_lasso_model_rds()
new_data <- read_rpadj_xlsx()
frm_adj <- adjust_frm(frm, new_data, verbose = 0)