Data Manipulation in R with dplyr - Summarize and the pipe operator

Section 7 - Last but not least: summarize

The syntax of summarize

summarize(), the last of the 5 verbs, follows the same syntax as mutate(), but the resulting dataset consists of a single row instead of an entire new column in the case of mutate().

In contrast to the four other data manipulation functions, summarize() does not return an altered copy of the dataset it is summarizing; instead, it builds a new dataset that contains only the summarizing statistics.

library(hflights)

hflights_df <- hflights[sample(nrow(hflights), 720), ] 
hflights <- as_tibble(hflights)

# Rename Carrier with long names
lut <- c("AA" = "American", "AS" = "Alaska", "B6" = "JetBlue", "CO" = "Continental", 
         "DL" = "Delta", "OO" = "SkyWest", "UA" = "United", "US" = "US_Airways", 
         "WN" = "Southwest", "EV" = "Atlantic_Southeast", "F9" = "Frontier", 
         "FL" = "AirTran", "MQ" = "American_Eagle", "XE" = "ExpressJet", "YV" = "Mesa")

# Add the Carrier column to hflights
hflights$UniqueCarrier<- lut[hflights$UniqueCarrier]

# Print out a summary with variables min_dist and max_dist
summarize(hflights, min_dist = min(Distance), max_dist = max(Distance))

## # A tibble: 1 x 2
##   min_dist max_dist
##      <dbl>    <dbl>
## 1       79     3904

# Print out a summary with variable max_div
filter(hflights, Diverted == 1) %>%
   summarize(max_div = max(Distance))

## # A tibble: 1 x 1
##   max_div
##     <dbl>
## 1    3904

Aggregate functions

You can use any function you like in summarize() so long as the function can take a vector of data and return a single number. R contains many aggregating functions, as dplyr calls them:

min(x) - minimum value of vector x.
max(x) - maximum value of vector x.
mean(x) - mean value of vector x.
median(x) - median value of vector x.
quantile(x, p) - pth quantile of vector x.
sd(x) - standard deviation of vector x.
var(x) - variance of vector x.
IQR(x) - Inter Quartile Range (IQR) of vector x.
diff(range(x)) - total range of vector x.

# Remove rows that have NA ArrDelay: temp1
temp1 <- filter(hflights, !is.na(ArrDelay))

# Generate summary about ArrDelay column of temp1
summarize(temp1, earliest = min(ArrDelay), average = mean(ArrDelay), latest = max(ArrDelay), sd = sd(ArrDelay))

## # A tibble: 1 x 4
##   earliest average latest    sd
##      <dbl>   <dbl>  <dbl> <dbl>
## 1      -70    7.09    978  30.7

# Keep rows that have no NA TaxiIn and no NA TaxiOut: temp2
temp2 <- filter(hflights, !is.na(TaxiIn), !is.na(TaxiOut))

# Print the maximum taxiing difference of temp2 with summarize()
summarize(temp2, max_taxi_diff = max(abs(TaxiIn - TaxiOut)))

## # A tibble: 1 x 1
##   max_taxi_diff
##           <int>
## 1           160

dplyr aggregate functions

dplyr provides several helpful aggregate functions of its own, in addition to the ones that are already defined in R. These include:

first(x) - The first element of vector x.
last(x) - The last element of vector x.
nth(x, n) - The nth element of vector x.
n() - The number of rows in the data.frame or group of observations that summarize() describes.
n_distinct(x) - The number of unique values in vector x.

Next to these dplyr-specific functions, you can also turn a logical test into an aggregating function with sum() or mean(). A logical test returns a vector of TRUE’s and FALSE’s. When you apply sum() or mean() to such a vector, R coerces each TRUE to a 1 and each FALSE to a 0. sum() then represents the total number of observations that passed the test; mean() represents the proportion.

# Generate summarizing statistics for hflights
summarize(hflights,
          n_obs = n(),
          n_carrier = n_distinct(UniqueCarrier),
          n_dest = n_distinct(Dest))

## # A tibble: 1 x 3
##    n_obs n_carrier n_dest
##    <int>     <int>  <int>
## 1 227496        15    116

# All American Airline flights
aa <- filter(hflights, UniqueCarrier == "American")

# Generate summarizing statistics for aa 
summarise(aa, 
          n_flights = n(),
          n_canc = sum(Cancelled == 1),
          avg_delay = mean(ArrDelay, na.rm=TRUE))

## # A tibble: 1 x 3
##   n_flights n_canc avg_delay
##       <int>  <int>     <dbl>
## 1      3244     60     0.892

How many American Airlines flights were cancelled? How many unique carriers are listed in ‘hflights’? You might have noticed that saving intermediate results to temporary variables or nesting function calls is cumbersome and error-prone.

Section 8 - Chaining your functions: the pipe operator

Overview of syntax

As another example of the %>%, have a look at the following two commands that are completely equivalent:

mean(c(1, 2, 3, NA), na.rm = TRUE)
c(1, 2, 3, NA) %>% mean(na.rm = TRUE)

The %>% operator allows you to extract the first argument of a function from the arguments list and put it in front of it, thus solving the Dagwood sandwich problem.

# Write the 'piped' version of the English sentences.
hflights %>%
    mutate(diff = (TaxiOut - TaxiIn)) %>%
    filter(!is.na(diff)) %>%
    summarize(avg = mean(diff))

## # A tibble: 1 x 1
##     avg
##   <dbl>
## 1  8.99

Drive or fly? Part 1 of 2

You can answer sophisticated questions by combining the verbs of dplyr. Over the next few exercises you will examine whether it sometimes makes sense to drive instead of fly. You will begin by making a data set that contains relevant variables. Then, you will find flights whose equivalent average velocity is lower than the velocity when traveling by car.

In the following instructions, you have to carry out a series of dplyr verbs on the hflights dataset. Make sure to use the %>% operator to chain them all together.

# Chain together mutate(), filter() and summarize()
# Actual elapsed time plus 100 minutes (for the overhead that flying involves)
# mph: calculated as 60 times Distance divided by RealTime
hflights %>%
    mutate(RealTime = ActualElapsedTime + 100, mph = (60 * Distance) / RealTime) %>%
    filter(!is.na(mph), mph < 70) %>%
    summarize(n_less = n(),
              n_dest = n_distinct(Dest),
              min_dist = min(Distance),
              max_dist = max(Distance))

## # A tibble: 1 x 4
##   n_less n_dest min_dist max_dist
##    <int>  <int>    <dbl>    <dbl>
## 1   6726     13       79      305

Try to interpret these results. For example, figure out how many destinations were flown to at a speed lower than 70 mph.

Drive or fly? Part 2 of 2

The previous exercise suggested that some flights might be less efficient than driving in terms of speed. But is speed all that matters? Flying imposes burdens on a traveler that driving does not. For example, airplane tickets are very expensive. Air travelers also need to limit what they bring on their trip and arrange for a pick up or a drop off. Given these burdens we might demand that a flight provide a large speed advantage over driving.

Let’s define preferable flights as flights that are at least 50% faster than driving, i.e. that travel 105 mph or greater in real time. Also, assume that cancelled or diverted flights are less preferable than driving.

# Finish the command with a filter() and summarize() call
hflights %>%
  mutate(
    RealTime = ActualElapsedTime + 100, 
    mph = 60 * Distance / RealTime
  ) %>%
  filter(mph < 105 | Cancelled == 1 | Diverted == 1) %>%
  summarise(n_non = n(), 
            p_non = n_non / nrow(hflights) * 100, 
            n_dest = n_distinct(Dest), 
            min_dist = min (Distance), 
            max_dist = max(Distance))

## # A tibble: 1 x 5
##   n_non p_non n_dest min_dist max_dist
##   <int> <dbl>  <int>    <dbl>    <dbl>
## 1 42400  18.6    113       79     3904

The results show that almost 19% of flights appear less desirable than simply driving to the destination, which is rather surprising!

Advanced piping exercise

Let’s use hflights to answer another question: How many flights were overnight flights?

# Count the number of overnight flights
hflights %>%
      filter(!is.na(DepTime), !is.na(ArrTime), DepTime > ArrTime) %>%
      summarize(num = n())

## # A tibble: 1 x 1
##     num
##   <int>
## 1  2718

Indeed, 265 flights! It’s official, you are master of pipes!

Session info

sessionInfo()

## R version 3.5.2 (2018-12-20)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 16299)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=German_Switzerland.1252  LC_CTYPE=German_Switzerland.1252   
## [3] LC_MONETARY=German_Switzerland.1252 LC_NUMERIC=C                       
## [5] LC_TIME=German_Switzerland.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] hflights_0.1     ggplot2_3.1.0    dplyr_0.8.0.1    gapminder_0.3.0 
## [5] kableExtra_1.0.1 knitr_1.21      
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.0        plyr_1.8.4        pillar_1.3.1     
##  [4] compiler_3.5.2    prettydoc_0.2.1   tools_3.5.2      
##  [7] digest_0.6.18     gtable_0.2.0      evaluate_0.12    
## [10] tibble_2.0.1      viridisLite_0.3.0 pkgconfig_2.0.2  
## [13] rlang_0.3.1       cli_1.0.1         rstudioapi_0.9.0 
## [16] yaml_2.2.0        xfun_0.4          withr_2.1.2      
## [19] httr_1.4.0        stringr_1.4.0     xml2_1.2.0       
## [22] hms_0.4.2         webshot_0.5.1     grid_3.5.2       
## [25] tidyselect_0.2.5  glue_1.3.0        R6_2.4.0         
## [28] fansi_0.4.0       rmarkdown_1.11    readr_1.3.1      
## [31] purrr_0.3.0       magrittr_1.5      scales_1.0.0     
## [34] htmltools_0.3.6   assertthat_0.2.0  rvest_0.3.2      
## [37] colorspace_1.4-0  utf8_1.1.4        stringi_1.3.1    
## [40] lazyeval_0.2.1    munsell_0.5.0     crayon_1.3.4