Ludvig Renbo Olsen | Portfolio

  • da
  • en

Menu

Filter

Social

groupdata2 v1.0.0 release

After having spent a lot of time developing groupdata2 in the spring, adding a lot of new features, I’ve finally found the time to submit it to CRAN.

For those new to groupdata2, it’s a collection of tools for creating groups from your data (hence the name!). From basic greedy groups, to balanced folds for cross-validation, to automatically finding group starts (whenever an element in a vector differs from the previous element), it has a lot to offer. The output is a grouping factor with integers that associates the element (row if in data frame) with its group, e.g. 1,1,1,2,2,2,3,3,3. In most use cases the input and output will be data frames, where the grouping factor is simply added to the output data frame.

A short list of the available use cases:

  • Group by specified number of groups.
  • Group by group size (greedily), i.e. grab n elements from the beginning. Group size can be passed as whole-number or percentage (0<n<1).
    This can be used to change the resolution of temporal data.
  • Group by a list of group sizes. Group sizes can be passed as whole-numbers or percentages (0<n<1).
  • Group by list of group starts (values in vector), or automatically find group starts.
  • Create (optionally) balanced folds for cross-validation.
  • Create (optionally) balanced partitions, e.g. for train/test sets.

Functions

The main functions included are group_factor(), group(), splt(), partition(), and fold().

group_factor() is at the heart of it all. It creates the groups and is used by the other functions. It returns a grouping factor with group numbers, i.e. 1s for all elements in group 1, 2s for group 2, etc. So if you ask it to create 2 groups from a vector (‘Hans’,’Dorte’,’Mikkel’,’Leif’) it will return a factor (1,1,2,2).

group() takes in either a data frame or vector and returns a data frame with a grouping factor added to it. The data frame is grouped by the grouping factor (using dplyr::group_by), which makes it very easy to use in dplyr pipelines.
If, for instance, you have a column in a data frame with quarterly measurements, and you would like to see the average measurement per year, you can simply create groups with a size of 4, and take the mean of each group, all within a 3-line pipeline.

splt() takes in either a data frame or vector, creates a grouping factor, and splits the given data by this factor using base::split. Often it will be faster to use group() instead of splt(). I also find it easier to work with the output of group() .

partition() creates (optionally) balanced partitions (e.g. train/test sets) from given group sizes. It can balance partitions on one categorical variable and/or is able to keep all data points with a shared ID in the same partition.

fold() creates (optionally) balanced folds for cross-validation. It can balance folds on one categorical variable and/or is able to keep all data points with a shared ID in the same fold.

Vignettes / Tutorials

Cross-validation with groupdata2
In this vignette, we go through the basics of cross-validation, such as creating balanced train/test sets with partition() and balanced folds with fold(). We also write up a simple cross-validation function and compare multiple linear regression models.

Time series with groupdata2
In this vignette, we divide up a time series into groups (windows) and subgroups using group() with the ‘greedy’ and ‘staircase’ methods. We do some basic descriptive stats of each group and use them to reduce the data size.

Automatic groups with groupdata2
In this vignette, we will use the ‘l_starts’ method with group() to allow transferring of information from one dataset to another. We will use the automatic grouping function that finds group starts all by itself.

For a more extensive description of the features in groupdata2, see Description of groupdata2.

Installation

groupdata2 can be downloaded and installed from CRAN or from GitHub like this:

CRAN version:

install.packages(“groupdata2”)

Development version:

install.packages(“devtools”)

devtools::install_github(“LudvigOlsen/groupdata2”)

 

splitChunk – RStudio addin for splitting code chunks in R Markdown

When working with R Markdown I usually use the key command cmd+alt+i to insert new code chunks, i.e. ```{r}\n\n\```. Often I do multiple things in one chunk and then want to split the chunk in two and write some text in-between.

To do this I have created an addin for RStudio that inserts ```\n\n```{r}. I have set this up with the key command cmd+alt+shift+i as it is kind of a “shifted” version of inserting a new chunk.

The cursor is positioned in-between the chunks allowing me to write an introduction to the following chunk or similar.

Install

The package splitChunk can be installed from GitHub. Paste the following code into the console and run it:

install.packages("devtools")

devtools::install_github("LudvigOlsen/splitChunk")

It is also available through the addinslist package’s Browse RStudio Addins addin.

Add Key Command

After installing it, add a key command (e.g. mac: cmd-alt-shift-i, win: ctrl-alt-shift-i) by going to

    • Tools > Addins > Browse Addins > Keyboard Shortcuts.
    • Find Split Code Chunk and press its field under Shortcut.
    • Press desired key command.
    • Press Apply.
    • Press Execute.
  • Press chosen key command inside a code chunk in R Markdown

 

Go To: Browse Addins

Go To: Browse Addins

 

Add shortcut

Add shortcut

Untoggle inline output in RMarkdown in RStudio

A lot of my class mates found themselves frustrated when updating to the newer versions of RStudio, because the output started showing inline, below the code, instead of in the console or Plots window as usual. Personally I like being able to see the output and the code at the same time, so I found out how to change back. It was pretty easy:

  • Go to RStudio >> Preferences >>  R Markdown
  • Now untoggle the “Show output inline for all R Markdown documents”
RMarkdown turn off inline output

RMarkdown: turn off inline output

  • Restart RStudio and everything should work as before.

There are of course advantages to keeping the output inline as well. Being able to scroll through the R Markdown file and see not only the code but also the output is nice. I usually just knit to html for this, but that can take quite some time if it is a long script or if there is a lot of plotting going on, etc. Switching between inline and the old way is also a useful option.