miércoles, 29 de mayo de 2019

Pipes and connections between functions in R

A recent addition to R is the pipe-forwarding mechanism (%>%) within the magrittr package. This is extremely useful when using the dplyr, ggvis, and tidyr packages, among others. Pipe forwarding is an alternative to nesting that yields code that can be read from top to bottom. Here we demonstrate an example that compares traditional (nested) dplyr function calls to the new pipe operator.





> ds = read.csv("helpmiss.csv")
> summarise(group_by(select(filter(mutate(ds,
+         sex=ifelse(female==1, "F", "M")), !is.na(pcs)), age, pcs, sex),
+         sex), meanage=mean(age), meanpcs=mean(pcs),n=n())
# A tibble: 2 × 4
    sex  meanage  meanpcs     n
  <chr>    <dbl>    <dbl> <int>
1     F 36.07207 44.86292   111
2     M 35.63025 49.07435   357

In this example, the output of the mutate() function is speci ed as the input to the filter() function, which prunes observations that are missing the pcs variable. The output from this function is sent to the select() function to create a subset of variables, and the results provided to the group by() function, which collapse the dataset by gender. The summarise() function calculates the average age and PCS (physical component score)
as well as the sample size. This nested code is very dificult for humans to parse. An alternative would be to save the intermediate results from the nested functions.

> ds2 = mutate(ds, sex=ifelse(female==1, "F", "M"))
> ds3 = filter(ds2, !is.na(pcs))
> ds4 = select(ds3, age, pcs, sex)
> ds5 = group_by(ds4, sex)
> summarise(ds5, meanage=mean(age), meanpcs=mean(pcs),n=n())
# A tibble: 2 × 4
    sex  meanage  meanpcs     n
  <chr>    <dbl>    <dbl> <int>
1     F 36.07207 44.86292   111
2     M 35.63025 49.07435   357

A disadvantage of this (somewhat clunky) approach is that it involves a lot of unnecessary copying. This may be particularly inefficient when processing large datasets. The same operations are done in a different (and likely more readable) manner using the %>% operator.

> ds %>%
+     mutate(sex=ifelse(female==1, "F", "M")) %>%
+     filter(!is.na(pcs)) %>%
+     select(age, pcs, sex) %>%
+     group_by(sex) %>%
+     summarise(meanage=mean(age), meanpcs=mean(pcs),n=n())
# A tibble: 2 × 4
    sex  meanage  meanpcs     n
  <chr>    <dbl>    <dbl> <int>
1     F 36.07207 44.86292   111
2     M 35.63025 49.07435   357

Here, it is clear what each operation within the \pipe stream" is doing. It is straightforward to debug expressions in this manner by just leaving off the %>% at each line: this will only evaluate the set of functions called to that point and display the intermediate output.


Reference

J., N., 2015. Using R And Rstudio For Data Management, Statistical Analysis, And Graphics. Chapman And Hall/crc.

No hay comentarios:

Publicar un comentario