WARNING: These functions are experimental and must not be used in production. Their API is very likely to change in non-backwards-compatible ways over the next few weeks.
Utilities for fitting, tuning and benchmarking 'metabodecon models' (mdm).
A mdm is a essentially a glmnet lasso model, fitted on a feature matrix X,
obtained by deconvoluting and aligning spectra and snapping their peaks to a
shared reference spectrum. Lambda is chosen via cross-validation on X.
Deconvolution parameters (npmax or nfit/smit/smws/delta),
alignment parameter maxShift, and peak-combining parameters
(maxCombine, combineMethod) are tunable hyperparameters.
When npmax > 0, deconvolution runs an internal grid search over smoothing
and fitting parameters and keeps at most npmax peaks. The explicit
nfit/smit/smws/delta values are ignored in that case. When
npmax == 0, the explicit parameters are used directly.
fit_mdm() deconvolutes spectra, aligns detected peaks,
combines them via get_si_mat() and fits one lasso model via
glmnet::cv.glmnet(). Lambda is selected internally by cross-validation
but no performance metrics are reported. Use cv_mdm() to
tune preprocessing parameters and benchmark_mdm() for
unbiased performance estimates.
cv_mdm() evaluates a grid of preprocessing parameter
combinations. For each grid row it builds a feature matrix, runs
glmnet::cv.glmnet() with a fixed fold assignment, and records the
held-out accuracy and AUC at lambda.min. Returns the model with the best
AUC and the full performance grid.
benchmark_mdm() wraps cv_mdm() in an
outer cross-validation loop to estimate end-to-end predictive performance.
It returns the per-fold models and held-out predictions.
Usage
fit_mdm(
spectra,
y,
sfr = NULL,
use_rust = TRUE,
nworkers = 1,
verbosity = 2,
seed = 1,
cadir = decon_cachedir(),
check = TRUE,
npmax = 1000,
nfit = 5,
smit = 2,
smws = 5,
delta = 6.4,
maxShift = 200,
maxCombine = 50,
combineMethod = 2,
nfolds = 10
)
cv_mdm(
spectra,
y,
sfr = NULL,
use_rust = TRUE,
nworkers = 1,
verbosity = 2,
seed = 1,
cadir = decon_cachedir(),
check = TRUE,
combineMethod = 2,
pgrid = get_pgrid("dynamic2"),
ignore = NULL,
warm_cache = TRUE,
nfolds = 10
)
benchmark_mdm(
spectra,
y,
sfr = NULL,
use_rust = TRUE,
verbosity = 2,
seed = 1,
cadir = decon_cachedir(),
nwo = 1,
nwi = half_cores(),
combineMethod = 2,
pgrid = get_pgrid(),
nfo = 5,
nfl = 10
)
get_pgrid(conf = "dynamic2")Arguments
- spectra
List-like spectra object with
csandsivectors.- y
Factor vector with class labels for each spectrum.
- sfr
Signal free region. See
deconvolute()for details.- use_rust
Logical. Whether to use the Rust backend.
- nworkers
Number of workers used by
fit_mdm()andcv_mdm()for deconvolution and alignment.- verbosity
Integer. Verbosity level; each nested call decrements by 1. Messages print when
verbosity >= 1.- seed
Random seed used for fold assignment in
benchmark_mdm()and for inner cross-validation splits incv_mdm()andfit_mdm().- cadir
Directory used to cache deconvolution results across grid points and processes. Defaults to
decon_cachedir().- check
Logical. Whether to validate inputs at function entry.
- npmax
Maximum number of peaks to retain. When
npmax > 0, deconvolution runs an internal grid search and the explicitnfit/smit/smws/deltaare ignored. Set to 0 to use explicit deconvolution parameters instead.- nfit
Number of Lorentz-curve fitting iterations (used when
npmax == 0).- smit
Number of smoothing iterations (used when
npmax == 0).- smws
Smoothing window size (used when
npmax == 0).- delta
Peak-filter threshold (used when
npmax == 0).- maxShift
Maximum alignment shift.
- maxCombine
Maximum peak-combining distance in datapoints, passed to
get_si_mat(). Interpretation depends oncombineMethod.- combineMethod
Peak-combining backend passed to
get_si_mat().1: column merging viacombine_peaks().2(default): nearest-neighbour snapping to reference peaks.- nfolds
Number of folds for the
cv.glmnet()call infit_mdm()andcv_mdm(). Default 10.- pgrid
Data frame of preprocessing parameter combinations as returned by
get_pgrid().- ignore
Optional integer vector of sample indices to exclude.
- warm_cache
Logical. Whether to pre-populate the disk cache before starting the grid search.
- nwo
Number of workers for the outer cross-validation in
benchmark_mdm(). Each outer worker holds a full copy ofspectraandy.- nwi
Number of workers used inside each outer fold for
cv_mdm()andpredict.mdm().- nfo
Number of outer folds in
benchmark_mdm().- nfl
Number of folds for the
cv.glmnet()call insidebenchmark_mdm(). Passed asnfoldstofit_mdm().- conf
Character string selecting a predefined parameter grid configuration.
Value
fit_mdm() returns an object of class mdm with elements
model (a glmnet::cv.glmnet() object), ref (the reference alignment
spectrum) and meta (list of preprocessing parameters).
cv_mdm() returns an mdm object with an additional element
pgrid containing the performance grid.
benchmark_mdm() returns a list with elements:
models: List ofmdmobjects, one per outer fold.predictions: Data frame with columnsfold,true,link,prob,pred.
Details
Disk caching across processes
cv_mdm() and benchmark_mdm() can share
deconvolution results through directory cadir. This on-disk cache lets
parallel workers reuse expensive preprocessing results while keeping RAM use
manageable. Disk caching can be disabled by setting cadir = NULL, but this will increase runtime drastically (from a few minutes/hours
to several days).
RAM caching within processes
fit_mdm() caches the most recent deconvolution and alignment
results in RAM when the input spectra carries a "hash" attribute. This
avoids redundant preprocessing when cv_mdm() evaluates
several settings on the same spectra.
Memory usage by parallel workers
Each outer worker in benchmark_mdm() keeps its own copy of
spectra and y, so memory use grows roughly with nwo. Large nwo values
can therefore require substantial RAM for large input datasets.
Inner workers (nwi) are used inside each outer worker for deconvolution,
alignment and prediction on held-out spectra. They process only the spectra
needed for the current task, so their memory impact is much smaller. In
RAM-constrained settings, use nwo = 1 and increase nwi instead.
The total amount of processes spawned is nwo * nwi.
Examples
if (FALSE) { # \dontrun{
# -~-~-~ Inputs -~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-
aki <- read_aki_data()
spectra <- aki$spectra
attr(spectra, "hash") <- rlang::hash(spectra)
y <- factor(aki$meta$type, levels = c("Control", "AKI"))
names(y) <- rownames(aki$meta)
sfr <- c(11, -2)
cadir <- cachedir("deconvs", persistent = TRUE)
# -~-~-~-~-~-~-~-~-~ Single -~-~-~-~-~-~-~-~-~
# Best "simple" model from aki.R (0.797 [0.75 - 0.83])
mdm <- fit_mdm(
spectra, y, sfr=NULL,
nfit=5, smit=3, smws=3, delta=3, npmax=0,
maxShift=128, maxCombine=256, combineMethod=1,
nworkers=half_cores(), cadir=cadir
)
mdm <- fit_mdm(
spectra, y, sfr = NULL,
nfit=5, smit=0, smws=0, delta=0, npmax=1000,
maxShift=64, maxCombine=16, combineMethod=1,
nworkers=half_cores(), cadir=cadir
)
# -~-~-~-~-~-~-~-~-~ Search -~-~-~-~-~-~-~-~-~
mdm_grid_stat1 <- cv_mdm(
spectra, y, pgrid=get_pgrid("static1"),
nworkers=half_cores(), use_rust=TRUE, cadir=cadir
)
saveRDS(mdm_grid_stat1, "tmp/mdm_grid_stat1.rds")
mdm_grid_stat2 <- cv_mdm(
spectra, y, pgrid=get_pgrid("static2"),
nworkers=half_cores(), use_rust=TRUE, cadir=cadir
)
saveRDS(mdm_grid_stat2, "tmp/mdm_grid_stat2.rds")
#
# Best static2: acc=0.82, auc=0.89
# smit=2, smws=3, delta=6, nfit=7, npmax=0, maxShift=100, maxCombine=30
mdm_grid_stat3 <- cv_mdm(
spectra, y, pgrid=get_pgrid("static3"),
nworkers=half_cores(), use_rust=TRUE, cadir=cadir
)
saveRDS(mdm_grid_stat3, "tmp/mdm_grid_stat3.rds")
#
# Best static3: acc=79.34% auc=0.8589
# smit=2, smws=7, delta=8, nfit=10, npmax=0, maxShift=100, maxCombine=30
mdm_grid_dyn2 <- cv_mdm(
spectra, y, pgrid=get_pgrid("dynamic2"),
nworkers=half_cores(), use_rust=TRUE, cadir=cadir
)
# -~-~-~ Full Benchmark -~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-
bm <- benchmark_mdm(spectra, y, sfr, cadir=cadir)
# -~-~-~ Interactive Development -~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~
stub(fit_mdm, spectra=spectra, y=y, sfr=sfr)
stub(cv_mdm, spectra=spectra, y=y, sfr=sfr)
stub(benchmark_mdm, spectra=spectra, y=y, sfr=sfr)
} # }