Ludvig Renbo Olsen | Portfolio

  • da
  • en

Menu

Filter

Social

groupdata2 version 1.1.0 released on CRAN

A few days ago, I released a new version of my R package, groupdata2, on CRAN. groupdata2 contains a set of functions for grouping data, such as creating balanced partitions and folds for cross-validation.

Version 1.1.0 adds the balance() function for using up- and downsampling to balance the categories (e.g. classes) in your dataset. The main difference between existing up-/downsampling tools and balance() is that it has methods for dealing with IDs. If, for instance, our dataset contains a number of participants with multiple measurements (rows) each, we might want to either keep all the rows of a participant or delete all of them, e.g. if we are measuring the effect of time and need all the timesteps for a meaningful analysis. balance() currently has four methods for dealing with IDs. As it is a new function, I’m very open to ideas for improvements and additional functionality. Feel free to open an issue on github or send me a mail at r-pkgs at ludvigolsen.dk

partition() and fold() can now balance the groups (partitions/folds) by a numerical column. In short, the rows are ordered as smallest, largest, second smallest, second largest, etc. (I refer to this as extreme pairing in the documentation), and grouped as pairs. This seems to work pretty well, especially with larger datasets.

fold() can now create multiple unique fold columns at once, e.g. for repeated cross-validation with cvms. As ensuring the uniqueness of the fold columns can take a while, I’ve added the option to run the column comparisons in parallel. You need to register a parallel backend first (E.g. with doParallel::registerDoParallel).

The new helper function differs_from_previous() finds values, or indices of values, that differ from the previous value by some threshold(s). It is related to the “l_starts” method in group().

groupdata2 can be installed with:

CRAN:
install.packages("groupdata2")
Development version:
devtools::install_github("ludvigolsen/groupdata2")

cvms 0.1.0 released on CRAN

After a fairly long life on GitHub, my R package, cvms, for cross-validating linear and logistic regression, is finally on CRAN!

With a few additions in the past months, this is a good time to catch you all up on the included functionality. For examples, check out the readme on GitHub!

The main purpose of cvms is to allow researchers to quickly compare their models with cross-validation, with a tidy output containing the relevant metrics. Once the best model has been selected, it can be validated (that is, trained on the entire training set and evaluated on a validation set) with the same set of metrics. cross_validate() and validate() are the main tools for this. Besides the set of evaluation metrics, the results and predictions from each cross-validation iteration are included, allowing further analysis.

The main additions and improvements to cross_validate() are, that we can now run repeated cross-validation by simply specifying a list of fold columns (as created by groupdata2::fold), and that the models can be cross-validated in parallel.

Four new functions have recently been added to cvms:

baseline()

baseline() creates baseline evaluations of the task at hand (either linear regression or binomial logistic regression). Baseline evaluations tell us what performance we should expect if our model was randomly guessing or always predicting the same value. With imbalanced datasets, this information is very important, as we could get seemingly good results with a useless model. Hence, we create a baseline evaluation and compare our models against it during our analysis.

combine_predictors()

combine_predictors() creates model formulas with all* combinations of a list of fixed effects, including two- and three-way interactions. *When including interaction terms, there are some restrictions, as the model formulas have been precomputed to speed up the process. With these restrictions, we can generate up to 259,358 formulas, which seems like enough for most use cases. While combine_predictors() is useful for trying out a lot of combinations of our fixed effects, we should of course be aware that such combinations may not be theoretically meaningful and could do well due to overfitting. Note that you can add random effects to the formulas, and that multiple versions of the same predictor (e.g. transformed and untransformed) can be added as a sublist, in which case model formulas with both versions will be created, without them ever being in the same formula.

select_metrics()

select_metrics() is used to select the evaluation metrics and the model formula columns from the output in cross_validate(). As we have a lot of information in the output, it allows us to focus on the metrics when reporting the models or creating tutorials.

reconstruct_formulas()

reconstruct_formulas() is used to reconstruct the model formulas from the output of cross_validate(). This is useful, if we have cross-validated a long list of models and want to perform repeated cross-validation on the best models. We simply order the results data frame by our selection criterion (or find the nondominated models with rPref::psel) and reconstruct the model formulas for the best models.

Installation

Installation of cvms:

CRAN:
install.packages("cvms")
Development version:
devtools::install_github("ludvigolsen/cvms")

 

Being the first CRAN release, I hope to get feedback that can improve cvms, be it changes or additions. Feel free to open an issue on GitHub or send me a mail at r-pkgs at ludvigolsen.dk

cvms was designed to work well with groupdata2. See the additions and improvements in version 1.1.0 of groupdata2 released along with cvms here.

Running cross_validate from cvms in parallel

The cvms package is useful for cross-validating a list of linear and logistic regression model formulas in R. To speed up the process, I’ve added the option to cross-validate the models in parallel. In this post, I will walk you through a simple example and introduce the combine_predictors() function, which generates model formulas by combining a list of fixed effects. We will be using the simple participant.scores dataset from cvms.

First, we will install the newest versions of cvms and groupdata2 from GitHub. You will also need the doParallel package.

# Install packages
devtools::install_github("ludvigolsen/groupdata2")
devtools::install_github("ludvigolsen/cvms")

Then, we attach the packages and set the random seed to 1.

# Attach packages
library(cvms) # cross_validate, combine_predictors
library(groupdata2) # fold
library(doParallel) # registerDoParallel

# Set seed for reproducibility
# Note that R versions < 3.6.0 may give different results
set.seed(1)

Now, we will create the folds for cross-validation. This simply adds a factor in the dataset called .folds with folds identifiers (e.g. 1,1,1,2,2,3,3,…). We will also ensure that we have a similar ratio of the two diagnoses in the folds, and that all rows pertaining to a participant is put in the same fold.

# Create folds in the dataset
data <- fold(participant.scores, k = 4,
cat_col = "diagnosis",
id_col = "participant")

We will use the combine_predictors() function to generate our model formulas. We supply the list of fixed effects (we will use age and score) and it combines them with and without interactions. Note that when we have more than 6 fixed effects, it becomes very slow due to the number of the possible combinations. To deal with this, it has some options to limit the number of fixed effects per formula, along with the maximum size of included interactions. We will not use those here though.

# Generate model formulas with combine_predictors()
models <- combine_predictors(dependent = "diagnosis",
fixed_effects = c("age", "score"))
models

### [1] "diagnosis ~ age" "diagnosis ~ score"
### [3] "diagnosis ~ age * score" "diagnosis ~ age + score"

We want to test if running cross_validate() in parallel is faster than running it sequentially. This would be hard to tell with only 4 simple models, so we repeat the model formulas 100 times each.

# Repeat formulas 100 times
models_repeated <- rep(models, each = 100)

Now we can cross-validate with and without parallelization. We will start without it.

# Cross-validate the model formulas without parallelization
system.time({cv_1 <- cross_validate(data,
models = models_repeated,
family = "binomial")})

### user system elapsed
### 26.290 0.194 26.595

This took 26.595 seconds to run.

For the parallelization, we will use the doParallel package. There are other options out there though.

First, we register the number of CPU cores to use. I will use 4 cores.

# Register CPU cores
registerDoParallel(4)

Then, we simply set parallel to TRUE in cross_validate().

# Cross-validate the model formulas with parallelization
system.time({cv_2 <- cross_validate(data,
models = models_repeated,
family = "binomial",
parallel = TRUE)})

### user system elapsed
### 39.274 1.845 10.955

This time it took only 10.955 seconds!

As these formulas are very simple, and the dataset is very small, it’s difficult to estimate how much time the parallelization will save in the real world. If we were cross-validating a lot of larger models on a big dataset, it could be a meaningful option.

In this post, you have learned to run cross_validate() in parallel. This functionality can also be found in validate(), and I have also added it to the new baseline() function, which I will cover in a future post. It creates baseline evaluations, so we have something to compare our awesome models to. Pretty neat!
You have also learned to generate model formulas with combine_predictors().

Repeated cross-validation in cvms and groupdata2

I have spent the last couple of days adding functionality for performing repeated cross-validation to cvms and groupdata2. In this quick post I will show an example.

(Please note: At the moment, you need to use the github version of groupdata2. I hope to update it on CRAN this month.)

In cross-validation, we split our training set into a number (often denoted “k”) of groups called folds. We repeatedly train our machine learning model on k-1 folds and test it on the last fold, such that each fold becomes test set once. Then we average the results and celebrate with food and music.

The benefits of using groupdata2 to create the folds are 1) that it allows us to balance the ratios of our output classes (or simply a categorical column, if we are working with linear regression instead of classification), and 2) that it allows us to keep all observations with a specific ID (e.g. participant/user ID) in the same fold to avoid leakage between the folds.

The benefit of cvms is that it trains all the models and outputs a tibble (data frame) with results, predictions, model coefficients, and other sweet stuff, which is easy to add to a report or do further analyses on. It even allows us to cross-validate multiple model formulas at once to quickly compare them and select the best model.

Repeated Cross-validation

In repeated cross-validation we simply repeat this process a couple of times, training the model on more combinations of our training set observations. The more combinations, the less one bad split of the data would impact our evaluation of the model.

For each repetition, we evaluate our model as we would have in regular cross-validation. Then we average the results from the repetitions and go back to food and music.

groupdata2

As stated, the role of groupdata2 is to create the folds. Normally it creates one column in the dataset called “.folds”, which contains a fold identifier for each observation (e.g. 1,1,2,2,3,3,1,1,3,3,2,2). In repeated cross-validation it simply creates multiple of such fold columns (“.folds_1”, “.folds_2”, etc.). It also makes sure they are unique, so we actually train on different subsets.

# Install groupdata2 and cvms from github
devtools::install_github("ludvigolsen/groupdata2")
devtools::install_github("ludvigolsen/cvms")

# Attach packages
library(cvms) # cross_validate()
library(groupdata2) # fold()
library(knitr) # kable()
library(dplyr) # %>%

# Set seed for reproducibility
set.seed(7)

# Load data
data <- participant.scores

# Fold data
# Create 3 fold columns
# cat_col is the categorical column to balance between folds
# id_col is the column with IDs. Observations with the same ID will be put in the same fold.
# num_fold_cols determines the number of fold columns, and thereby the number of repetitions.
data <- fold(data, k = 4, cat_col = 'diagnosis', id_col = 'participant', num_fold_cols = 3)

# Show first 15 rows of data
data %>% head(10) %>% kable()

Data Subset with 3 Fold Columns

Data Subset with 3 Fold Columns

cvms

In the cross_validate function, we specify our model formula for a logistic regression that classifies diagnosis. cvms currently supports linear regression and logistic regression, including mixed effects modelling. In the fold_cols (previously called folds_col), we specify the fold column names.

CV <- cross_validate(data, "diagnosis~score",
fold_cols = c('.folds_1','.folds_2','.folds_3'),
family='binomial',
REML = FALSE)

# Show results
CV

Repeated CV results 1

Output tibble

Due to the number of metrics and useful information, it helps to break up the output into parts:

CV %>% select(1:7) %>% kable()

Repeated CV metrics 1

Evaluation metrics (subset 1)

CV %>% select(8:14) %>% kable()

Repeated CV metrics 2

Evaluation metrics (subset 2)

CV$Predictions[[1]] %>% head() %>% kable()

Repeated CV nested predictions

Nested predictions (subset)

CV$`Confusion Matrix`[[1]] %>% head() %>% kable()

Repeated CV nested confusion matrices

Nested confusion matrices (subset)

CV$Coefficients[[1]] %>% head() %>% kable()

Repeated CV Nested model coefficients

Nested model coefficients (subset)

CV$Results[[1]] %>% select(1:8) %>% kable()

Repeated CV nested results per fold column

Nested results per fold column (subset)

 

We could have trained multiple models at once by simply adding more model formulas. That would add rows to the output, making it easy to compare the models.

The linear regression version has different evaluation metrics. These are listed in the help page at ?cross_validate.

Conclusion

cvms and groupdata2 now have the functionality for performing repeated cross-validation. We have briefly talked about this technique and gone through a short example. Check out cvms for more ūüôā

groupdata2 v1.0.0 release

After having spent a lot of time developing groupdata2 in the spring, adding a lot of new features, I’ve finally found the time to submit it to CRAN.

For those new to groupdata2, it’s a collection of tools for creating groups from your data (hence the name!). From basic greedy groups, to balanced folds for cross-validation, to automatically finding group starts (whenever an element in a vector differs from the previous element), it has a lot to offer. The output is a grouping factor with integers that associates the element (row if in data frame) with its group, e.g. 1,1,1,2,2,2,3,3,3. In most use cases the input and output will be data frames, where the grouping factor is simply added to the output data frame.

A short list of the available use cases:

  • Group by specified number of groups.
  • Group by group size (greedily), i.e. grab n elements from the beginning. Group size can be passed as whole-number or percentage (0<n<1).
    This can be used to change the resolution of temporal data.
  • Group by a list of group sizes. Group sizes can be passed as whole-numbers or percentages (0<n<1).
  • Group by list of group starts (values in vector), or automatically find group starts.
  • Create (optionally) balanced folds for cross-validation.
  • Create (optionally) balanced partitions, e.g. for train/test sets.

Functions

The main functions included are group_factor(), group(), splt(), partition(), and fold().

group_factor() is at the heart of it all. It creates the groups and is used by the other functions. It returns a grouping factor with group numbers, i.e.¬†1s for all elements in group 1, 2s for group 2, etc. So if you ask it to create 2 groups from a vector (‘Hans’,’Dorte’,’Mikkel’,’Leif’) it will return a factor (1,1,2,2).

group() takes in either a data frame or vector and returns a data frame with a grouping factor added to it. The data frame is grouped by the grouping factor (using dplyr::group_by), which makes it very easy to use in dplyr pipelines.
If, for instance, you have a column in a data frame with quarterly measurements, and you would like to see the average measurement per year, you can simply create groups with a size of 4, and take the mean of each group, all within a 3-line pipeline.

splt() takes in either a data frame or vector, creates a grouping factor, and splits the given data by this factor using base::split. Often it will be faster to use group() instead of splt(). I also find it easier to work with the output of group() .

partition() creates (optionally) balanced partitions (e.g. train/test sets) from given group sizes. It can balance partitions on one categorical variable and/or is able to keep all data points with a shared ID in the same partition.

fold() creates (optionally) balanced folds for cross-validation. It can balance folds on one categorical variable and/or is able to keep all data points with a shared ID in the same fold.

Vignettes / Tutorials

Cross-validation with groupdata2
In this vignette, we go through the basics of cross-validation, such as creating balanced train/test sets with partition() and balanced folds with fold(). We also write up a simple cross-validation function and compare multiple linear regression models.

Time series with groupdata2
In this vignette, we divide up a time series into groups (windows) and subgroups using group() with the ‚Äėgreedy‚Äô and ‚Äėstaircase‚Äô methods. We do some basic descriptive stats of each group and use them to reduce the data size.

Automatic groups with groupdata2
In this vignette, we will use the ‚Äėl_starts‚Äô method with group() to allow transferring of information from one dataset to another. We will use the automatic grouping function that finds group starts all by itself.

For a more extensive description of the features in groupdata2, see Description of groupdata2.

Installation

groupdata2 can be downloaded and installed from CRAN or from GitHub like this:

CRAN version:

install.packages(“groupdata2”)

Development version:

install.packages(“devtools”)

devtools::install_github(“LudvigOlsen/groupdata2”)

 

splitChunk – RStudio addin for splitting code chunks in R Markdown

When working with R Markdown I usually use the key command cmd+alt+i to insert new code chunks, i.e. ```{r}\n\n\```. Often I do multiple things in one chunk and then want to split the chunk in two and write some text in-between.

To do this I have created an addin for RStudio that inserts¬†```\n\n```{r}. I have set this up with the key command cmd+alt+shift+i as it is kind of a “shifted” version of inserting a new chunk.

The cursor is positioned in-between the chunks allowing me to write an introduction to the following chunk or similar.

Install

The package splitChunk can be installed from GitHub. Paste the following code into the console and run it:

install.packages("devtools")

devtools::install_github("LudvigOlsen/splitChunk")

It is also available through the addinslist package‘s¬†Browse RStudio Addins addin.

Add Key Command

After installing it, add a key command (e.g. mac: cmd-alt-shift-i, win: ctrl-alt-shift-i) by going to

    • Tools¬†>¬†Addins¬†>¬†Browse Addins¬†>¬†Keyboard Shortcuts.
    • Find¬†Split Code Chunk¬†and press its field under¬†Shortcut.
    • Press desired key command.
    • Press¬†Apply.
    • Press¬†Execute.
  • Press chosen key command inside a code chunk in¬†R Markdown

 

Go To: Browse Addins

Go To: Browse Addins

 

Add shortcut

Add shortcut

Untoggle inline output in RMarkdown in RStudio

A lot of my class mates found themselves frustrated when updating to the newer versions of RStudio, because the output started showing inline, below the code, instead of in the console or Plots window as usual. Personally I like being able to see the output and the code at the same time, so I found out how to change back. It was pretty easy:

  • Go to RStudio >> Preferences >> ¬†R Markdown
  • Now untoggle the “Show output inline for all R Markdown documents”
RMarkdown turn off inline output

RMarkdown: turn off inline output

  • Restart RStudio and everything should work as before.

There are of course advantages to keeping the output inline as well. Being able to scroll through the R Markdown file and see not only the code but also the output is nice. I usually just knit to html for this, but that can take quite some time if it is a long script or if there is a lot of plotting going on, etc. Switching between inline and the old way is also a useful option.