Metabodecon Models — mdm • metabodecon

WARNING: These functions are experimental and must not be used in production. Their API is very likely to change in non-backwards-compatible ways over the next few weeks.

Utilities for fitting, tuning and benchmarking 'metabodecon models' (mdm).

A mdm is a essentially a glmnet lasso model, fitted on a feature matrix X, obtained by deconvoluting and aligning spectra and snapping their peaks to a shared reference spectrum. Lambda is chosen via cross-validation on X. Deconvolution parameters (npmax or nfit/smit/smws/delta), alignment parameter maxShift, and peak-combining parameters (maxCombine, combineMethod) are tunable hyperparameters.

When npmax > 0, deconvolution runs an internal grid search over smoothing and fitting parameters and keeps at most npmax peaks. The explicit nfit/smit/smws/delta values are ignored in that case. When npmax == 0, the explicit parameters are used directly.

fit_mdm() deconvolutes spectra, aligns detected peaks, combines them via get_si_mat() and fits one lasso model via glmnet::cv.glmnet(). Lambda is selected internally by cross-validation but no performance metrics are reported. Use cv_mdm() to tune preprocessing parameters and benchmark_mdm() for unbiased performance estimates.

cv_mdm() evaluates a grid of preprocessing parameter combinations. For each grid row it builds a feature matrix, runs glmnet::cv.glmnet() with a fixed fold assignment, and records the held-out accuracy and AUC at lambda.min. Returns the model with the best AUC and the full performance grid.

benchmark_mdm() wraps cv_mdm() in an outer cross-validation loop to estimate end-to-end predictive performance. It returns the per-fold models and held-out predictions.

Usage

fit_mdm(
  spectra,
  y,
  sfr = NULL,
  use_rust = TRUE,
  nworkers = 1,
  verbosity = 2,
  seed = 1,
  cadir = decon_cachedir(),
  check = TRUE,
  npmax = 1000,
  nfit = 5,
  smit = 2,
  smws = 5,
  delta = 6.4,
  maxShift = 200,
  maxCombine = 50,
  combineMethod = 2,
  nfolds = 10
)

cv_mdm(
  spectra,
  y,
  sfr = NULL,
  use_rust = TRUE,
  nworkers = 1,
  verbosity = 2,
  seed = 1,
  cadir = decon_cachedir(),
  check = TRUE,
  combineMethod = 2,
  pgrid = get_pgrid("dynamic2"),
  ignore = NULL,
  warm_cache = TRUE,
  nfolds = 10
)

benchmark_mdm(
  spectra,
  y,
  sfr = NULL,
  use_rust = TRUE,
  verbosity = 2,
  seed = 1,
  cadir = decon_cachedir(),
  nwo = 1,
  nwi = half_cores(),
  combineMethod = 2,
  pgrid = get_pgrid(),
  nfo = 5,
  nfl = 10
)

get_pgrid(conf = "dynamic2")

Arguments

spectra: List-like spectra object with cs and si vectors.
y: Factor vector with class labels for each spectrum.
sfr: Signal free region. See deconvolute() for details.
use_rust: Logical. Whether to use the Rust backend.
nworkers: Number of workers used by fit_mdm() and cv_mdm() for deconvolution and alignment.
verbosity: Integer. Verbosity level; each nested call decrements by 1. Messages print when verbosity >= 1.
seed: Random seed used for fold assignment in benchmark_mdm() and for inner cross-validation splits in cv_mdm() and fit_mdm().
cadir: Directory used to cache deconvolution results across grid points and processes. Defaults to decon_cachedir().
check: Logical. Whether to validate inputs at function entry.
npmax: Maximum number of peaks to retain. When npmax > 0, deconvolution runs an internal grid search and the explicit nfit/smit/smws/delta are ignored. Set to 0 to use explicit deconvolution parameters instead.
nfit: Number of Lorentz-curve fitting iterations (used when npmax == 0).
smit: Number of smoothing iterations (used when npmax == 0).
smws: Smoothing window size (used when npmax == 0).
delta: Peak-filter threshold (used when npmax == 0).
maxShift: Maximum alignment shift.
maxCombine: Maximum peak-combining distance in datapoints, passed to get_si_mat(). Interpretation depends on combineMethod.
combineMethod: Peak-combining backend passed to get_si_mat(). 1: column merging via combine_peaks(). 2 (default): nearest-neighbour snapping to reference peaks.
nfolds: Number of folds for the cv.glmnet() call in fit_mdm() and cv_mdm(). Default 10.
pgrid: Data frame of preprocessing parameter combinations as returned by get_pgrid().
ignore: Optional integer vector of sample indices to exclude.
warm_cache: Logical. Whether to pre-populate the disk cache before starting the grid search.
nwo: Number of workers for the outer cross-validation in benchmark_mdm(). Each outer worker holds a full copy of spectra and y.
nwi: Number of workers used inside each outer fold for cv_mdm() and predict.mdm().
nfo: Number of outer folds in benchmark_mdm().
nfl: Number of folds for the cv.glmnet() call inside benchmark_mdm(). Passed as nfolds to fit_mdm().
conf: Character string selecting a predefined parameter grid configuration.

Value

fit_mdm() returns an object of class mdm with elements model (a glmnet::cv.glmnet() object), ref (the reference alignment spectrum) and meta (list of preprocessing parameters).

cv_mdm() returns an mdm object with an additional element pgrid containing the performance grid.

benchmark_mdm() returns a list with elements:

models: List of mdm objects, one per outer fold.
predictions: Data frame with columns fold, true, link, prob, pred.

Details

Disk caching across processes

cv_mdm() and benchmark_mdm() can share deconvolution results through directory cadir. This on-disk cache lets parallel workers reuse expensive preprocessing results while keeping RAM use manageable. Disk caching can be disabled by setting cadir = NULL, but this will increase runtime drastically (from a few minutes/hours to several days).

RAM caching within processes

fit_mdm() caches the most recent deconvolution and alignment results in RAM when the input spectra carries a "hash" attribute. This avoids redundant preprocessing when cv_mdm() evaluates several settings on the same spectra.

Memory usage by parallel workers

Each outer worker in benchmark_mdm() keeps its own copy of spectra and y, so memory use grows roughly with nwo. Large nwo values can therefore require substantial RAM for large input datasets.

Inner workers (nwi) are used inside each outer worker for deconvolution, alignment and prediction on held-out spectra. They process only the spectra needed for the current task, so their memory impact is much smaller. In RAM-constrained settings, use nwo = 1 and increase nwi instead.

The total amount of processes spawned is nwo * nwi.

Examples

if (FALSE) { # \dontrun{



     # -~-~-~ Inputs -~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-

     aki <- read_aki_data()
     spectra <- aki$spectra
     attr(spectra, "hash") <- rlang::hash(spectra)
     y <- factor(aki$meta$type, levels = c("Control", "AKI"))
     names(y) <- rownames(aki$meta)
     sfr <- c(11, -2)
     cadir <- cachedir("deconvs", persistent = TRUE)



     # -~-~-~-~-~-~-~-~-~ Single -~-~-~-~-~-~-~-~-~

     # Best "simple" model from aki.R (0.797 [0.75 - 0.83])
     mdm <- fit_mdm(
         spectra, y, sfr=NULL,
         nfit=5, smit=3, smws=3, delta=3, npmax=0,
         maxShift=128, maxCombine=256, combineMethod=1,
         nworkers=half_cores(), cadir=cadir
     )

     mdm <- fit_mdm(
         spectra, y, sfr = NULL,
         nfit=5, smit=0, smws=0, delta=0, npmax=1000,
         maxShift=64, maxCombine=16, combineMethod=1,
         nworkers=half_cores(), cadir=cadir
     )



     # -~-~-~-~-~-~-~-~-~ Search -~-~-~-~-~-~-~-~-~

     mdm_grid_stat1 <- cv_mdm(
         spectra, y, pgrid=get_pgrid("static1"),
         nworkers=half_cores(), use_rust=TRUE, cadir=cadir
     )
     saveRDS(mdm_grid_stat1, "tmp/mdm_grid_stat1.rds")


     mdm_grid_stat2 <- cv_mdm(
         spectra, y, pgrid=get_pgrid("static2"),
         nworkers=half_cores(), use_rust=TRUE, cadir=cadir
     )
     saveRDS(mdm_grid_stat2, "tmp/mdm_grid_stat2.rds")
     #
     # Best static2: acc=0.82, auc=0.89
     # smit=2, smws=3, delta=6, nfit=7, npmax=0, maxShift=100, maxCombine=30


     mdm_grid_stat3 <- cv_mdm(
         spectra, y, pgrid=get_pgrid("static3"),
         nworkers=half_cores(), use_rust=TRUE, cadir=cadir
     )
     saveRDS(mdm_grid_stat3, "tmp/mdm_grid_stat3.rds")
     #
     # Best static3: acc=79.34% auc=0.8589
     # smit=2, smws=7, delta=8, nfit=10, npmax=0, maxShift=100, maxCombine=30


     mdm_grid_dyn2 <- cv_mdm(
         spectra, y, pgrid=get_pgrid("dynamic2"),
         nworkers=half_cores(), use_rust=TRUE, cadir=cadir
     )

     # -~-~-~ Full Benchmark -~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-
     bm <- benchmark_mdm(spectra, y, sfr, cadir=cadir)

     # -~-~-~ Interactive Development -~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~
     stub(fit_mdm, spectra=spectra, y=y, sfr=sfr)
     stub(cv_mdm, spectra=spectra, y=y, sfr=sfr)
     stub(benchmark_mdm, spectra=spectra, y=y, sfr=sfr)
} # }