With a few additions in the past months, this is a good time to catch you all up on the included functionality. For examples, check out the readme on GitHub!
The main purpose of cvms is to allow researchers to quickly compare their models with cross-validation, with a tidy output containing the relevant metrics. Once the best model has been selected, it can be validated (that is, trained on the entire training set and evaluated on a validation set) with the same set of metrics. cross_validate() and validate() are the main tools for this. Besides the set of evaluation metrics, the results and predictions from each cross-validation iteration are included, allowing further analysis.
The main additions and improvements to cross_validate() are, that we can now run repeated cross-validation by simply specifying a list of fold columns (as created by groupdata2::fold), and that the models can be cross-validated in parallel.
Four new functions have recently been added to cvms:
baseline() creates baseline evaluations of the task at hand (either linear regression or binomial logistic regression). Baseline evaluations tell us what performance we should expect if our model was randomly guessing or always predicting the same value. With imbalanced datasets, this information is very important, as we could get seemingly good results with a useless model. Hence, we create a baseline evaluation and compare our models against it during our analysis.
combine_predictors() creates model formulas with all* combinations of a list of fixed effects, including two- and three-way interactions. *When including interaction terms, there are some restrictions, as the model formulas have been precomputed to speed up the process. With these restrictions, we can generate up to 259,358 formulas, which seems like enough for most use cases. While combine_predictors() is useful for trying out a lot of combinations of our fixed effects, we should of course be aware that such combinations may not be theoretically meaningful and could do well due to overfitting. Note that you can add random effects to the formulas, and that multiple versions of the same predictor (e.g. transformed and untransformed) can be added as a sublist, in which case model formulas with both versions will be created, without them ever being in the same formula.
select_metrics() is used to select the evaluation metrics and the model formula columns from the output in cross_validate(). As we have a lot of information in the output, it allows us to focus on the metrics when reporting the models or creating tutorials.
reconstruct_formulas() is used to reconstruct the model formulas from the output of cross_validate(). This is useful, if we have cross-validated a long list of models and want to perform repeated cross-validation on the best models. We simply order the results data frame by our selection criterion (or find the nondominated models with rPref::psel) and reconstruct the model formulas for the best models.
Installation of cvms:
Being the first CRAN release, I hope to get feedback that can improve cvms, be it changes or additions. Feel free to open an issue on GitHub or send me a mail at r-pkgs at ludvigolsen.dk
cvms was designed to work well with groupdata2. See the additions and improvements in version 1.1.0 of groupdata2 released along with cvms here.