For those new to groupdata2, it’s a collection of tools for creating groups from your data (hence the name!). From basic greedy groups, to balanced folds for cross-validation, to automatically finding group starts (whenever an element in a vector differs from the previous element), it has a lot to offer. The output is a grouping factor with integers that associates the element (row if in data frame) with its group, e.g. 1,1,1,2,2,2,3,3,3. In most use cases the input and output will be data frames, where the grouping factor is simply added to the output data frame.
A short list of the available use cases:
- Group by specified number of groups.
- Group by group size (greedily), i.e. grab n elements from the beginning. Group size can be passed as whole-number or percentage (0<n<1).
This can be used to change the resolution of temporal data.
- Group by a list of group sizes. Group sizes can be passed as whole-numbers or percentages (0<n<1).
- Group by list of group starts (values in vector), or automatically find group starts.
- Create (optionally) balanced folds for cross-validation.
- Create (optionally) balanced partitions, e.g. for train/test sets.
The main functions included are group_factor(), group(), splt(), partition(), and fold().
group_factor() is at the heart of it all. It creates the groups and is used by the other functions. It returns a grouping factor with group numbers, i.e. 1s for all elements in group 1, 2s for group 2, etc. So if you ask it to create 2 groups from a vector (‘Hans’,’Dorte’,’Mikkel’,’Leif’) it will return a factor (1,1,2,2).
group() takes in either a data frame or vector and returns a data frame with a grouping factor added to it. The data frame is grouped by the grouping factor (using dplyr::group_by), which makes it very easy to use in dplyr pipelines.
If, for instance, you have a column in a data frame with quarterly measurements, and you would like to see the average measurement per year, you can simply create groups with a size of 4, and take the mean of each group, all within a 3-line pipeline.
splt() takes in either a data frame or vector, creates a grouping factor, and splits the given data by this factor using base::split. Often it will be faster to use group() instead of splt(). I also find it easier to work with the output of group() .
partition() creates (optionally) balanced partitions (e.g. train/test sets) from given group sizes. It can balance partitions on one categorical variable and/or is able to keep all data points with a shared ID in the same partition.
fold() creates (optionally) balanced folds for cross-validation. It can balance folds on one categorical variable and/or is able to keep all data points with a shared ID in the same fold.
Vignettes / Tutorials
Cross-validation with groupdata2
In this vignette, we go through the basics of cross-validation, such as creating balanced train/test sets with partition() and balanced folds with fold(). We also write up a simple cross-validation function and compare multiple linear regression models.
Time series with groupdata2
In this vignette, we divide up a time series into groups (windows) and subgroups using group() with the ‘greedy’ and ‘staircase’ methods. We do some basic descriptive stats of each group and use them to reduce the data size.
Automatic groups with groupdata2
In this vignette, we will use the ‘l_starts’ method with group() to allow transferring of information from one dataset to another. We will use the automatic grouping function that finds group starts all by itself.
For a more extensive description of the features in groupdata2, see Description of groupdata2.