groupdata2 is an R package meant to make grouping / splitting of data easy, while containing a large range of methods for grouping in different contexts. It contains functions for creating balanced folds for cross-validation and balanced partitions for train/test sets.
A short list of the available use cases:
- Group by specified number of groups.
- Group by group size (greedily), i.e. grab n elements from the beginning. Group size can be passed as whole-number or percentage (0<n<1).
This can be used to change the resolution of temporal data.
- Group by a list of group sizes. Group sizes can be passed as whole-numbers or percentages (0<n<1).
- Group by list of group starts (values in vector), or automatically find group starts.
- Create (optionally) balanced folds for cross-validation.
- Create (optionally) balanced partitions, e.g. for train/test sets.
- Balance a dataset with up- and downsampling, with different methods for handling IDs.
The main functions included are group_factor(), group(), splt(), partition(), fold(), and balance().
group_factor() is at the heart of it all. It creates the groups and is used by the other functions. It returns a grouping factor with group numbers, i.e. 1s for all elements in group 1, 2s for group 2, etc. So if you ask it to create 2 groups from a vector (‘Hans’,’Dorte’,’Mikkel’,’Leif’) it will return a factor (1,1,2,2).
group() takes in either a data frame or vector and returns a data frame with a grouping factor added to it. The data frame is grouped by the grouping factor (using dplyr::group_by), which makes it very easy to use in dplyr pipelines.
If, for instance, you have a column in a data frame with quarterly measurements, and you would like to see the average measurement per year, you can simply create groups with a size of 4, and take the mean of each group, all within a 3-line pipeline.
splt() takes in either a data frame or vector, creates a grouping factor, and splits the given data by this factor using base::split. Often it will be faster to use group() instead of splt(). I also find it easier to work with the output of group() .
partition() creates (optionally) balanced partitions (e.g. train/test sets) from given group sizes. It can balance partitions on one categorical variable, one numeric variable, and/or is able to keep all data points with a shared ID in the same partition.
fold() creates (optionally) balanced folds for cross-validation. It can balance folds on one categorical variable, one numeric variable, and/or is able to keep all data points with a shared ID in the same fold. Create multiple unique fold columns at once, e.g. for repeated cross-validation with cvms.
balance() uses up- and/or downsampling to fix the group size to the
mediangroup size or to a specific number of rows. Has a range of methods for balancing on ID level.
Vignettes / Tutorials
Cross-validation with groupdata2
In this vignette, we go through the basics of cross-validation, such as creating balanced train/test sets with partition() and balanced folds with fold(). We also write up a simple cross-validation function and compare multiple linear regression models.
Time series with groupdata2
In this vignette, we divide up a time series into groups (windows) and subgroups using group() with the ‘greedy’ and ‘staircase’ methods. We do some basic descriptive stats of each group and use them to reduce the data size.
Automatic groups with groupdata2
In this vignette, we will use the ‘l_starts’ method with group() to allow transferring of information from one dataset to another. We will use the automatic grouping function that finds group starts all by itself.
For a more extensive description of the features in groupdata2, see Description of groupdata2.