Whats Covered

Aditional Resources


Advanced joining


What can go wrong?

  • This chapter is about becoming a master at joining data
  • The main problems with joins come from missing or duplicate key values or columns
  • NA in the key column?
    • Probably just need to remove these rows
  • Missing key column?
    • it might be in the row name. use rownames_to_column in tibble package
  • duplicate keys?
    • Might just need to include another column in the key criteria
    • If you really have duplicates be careful as you will get every combination of key value matches across datasets.

Spot the key

R’s data frames can store important information in the row.names attribute. This is not a tidy way to store data, but it does happen quite commonly. If the primary key of your dataset is stored in row.names, you will have trouble joining it to other datasets.

For example, stage_songs contains information about songs that appear in musicals. However, it stores the primary key (song name) in the row.names attribute. As a result, you cannot access the key with a join function.

One way to remedy this problem is to use the function rownames_to_column() from the tibble package. rownames_to_column() returns a copy of a dataset with the row names added to the data as a column.

# Whoops, the primary key is in the row names for stage_songs
# We need to fix that before joining to stage_writers
stage_songs
##                                     musical year
## Children Will Listen         Into the Woods 1986
## Maria                       West Side Story 1957
## Memory                                 Cats 1981
## The Music of the Night Phantom of the Opera 1986
stage_writers
##                     song            composer
## 1   Children Will Listen    Stephen Sondheim
## 2                  Maria   Leonard Bernstein
## 3                 Memory Andrew Lloyd Webber
## 4 The Music of the Night Andrew Lloyd Webber
# Load the tibble package
library(tibble)

stage_songs %>% 
  # Add row names as a column named song
  rownames_to_column("song") %>% 
  # Left join stage_writers to stage_songs
  left_join(stage_writers, by = c("song")) %>% 
  # Create pretty table
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>%
  row_spec(0, bold = T, color = "white", background = "#3f7689")
song musical year composer
Children Will Listen Into the Woods 1986 Stephen Sondheim
Maria West Side Story 1957 Leonard Bernstein
Memory Cats 1981 Andrew Lloyd Webber
The Music of the Night Phantom of the Opera 1986 Andrew Lloyd Webber

The tibble package contains many functions that began in dplyr and have since been moved to tibble

Non-unique keys

shows and composers provide some more information about songs that appear in musicals.

You can join the two datasets by the musical column, which appears in both the datasets. However, you will run into a glitch: the values of musical do not uniquely identify the rows of composers. For example, two rows have the value “The Sound of Music” for musical and two other rows have the value “The King and I”.

How many entries (rows) will dplyr create for The Sound of Music if you left join composers to shows by musical?

  • Non unique keys can be okay if you want the info for each entry and don’t mind it being duplicated.
    • like years will get duplicated here because there are non unique keys in the musical column of composers dataset
shows
## # A tibble: 6 x 2
##   musical               year
##   <chr>                <int>
## 1 The Sound of Music    1959
## 2 Into the Woods        1986
## 3 The King and I        1951
## 4 West Side Story       1957
## 5 Cats                  1981
## 6 Phantom of the Opera  1986
# There are more than one composer for some musicals, so the musical column has duplicate key values. 
composers
## # A tibble: 8 x 2
##   musical              composer            
##   <chr>                <chr>               
## 1 Cats                 Andrew Lloyd Webber 
## 2 Into the Woods       Stephen Sondheim    
## 3 Phantom of the Opera Andrew Lloyd Webber 
## 4 The King and I       Richard Rogers      
## 5 The King and I       Oscar Hammerstein II
## 6 The Sound of Music   Richard Rogers      
## 7 The Sound of Music   Oscar Hammerstein II
## 8 West Side Story      Leonard Bernstein
# join by musical
shows %>% left_join(composers, by = "musical") %>% 
  # Create pretty table
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>%
  row_spec(0, bold = T, color = "white", background = "#3f7689")
musical year composer
The Sound of Music 1959 Richard Rogers
The Sound of Music 1959 Oscar Hammerstein II
Into the Woods 1986 Stephen Sondheim
The King and I 1951 Richard Rogers
The King and I 1951 Oscar Hammerstein II
West Side Story 1957 Leonard Bernstein
Cats 1981 Andrew Lloyd Webber
Phantom of the Opera 1986 Andrew Lloyd Webber

If a row in the primary dataset contains multiple matches in the secondary dataset, left_join() will duplicate the row once for every match.

Two non-unique keys

You saw in the last exercise that if a row in the primary dataset contains multiple matches in the secondary dataset, left_join() will duplicate the row once for every match. This is true for all of dplyr’s join functions.

Now, let’s see how this rule would apply when the primary dataset contains duplicate key values.

show_songs contains songs that appear in the musicals written by the composers. You can join the two by the musical column, but like composers, show_songs has two rows where musical == "The Sound of Music".

How many entries (rows) will exist for The Sound of Music if you left join composers to show_songs by musical?

show_songs
## # A tibble: 8 x 3
##   song                        musical               year
##   <chr>                       <chr>                <int>
## 1 A Few of My Favorite Things The Sound of Music    1959
## 2 Children Will Listen        Into the Woods        1986
## 3 Edelweiss                   The Sound of Music    1959
## 4 Getting to Know You         The King and I        1951
## 5 Maria                       West Side Story       1957
## 6 Memory                      Cats                  1981
## 7 Shall We Dance?             The King and I        1951
## 8 The Music of the Night      Phantom of the Opera  1986
show_songs %>% 
  left_join(composers, by = "musical") %>%
  arrange(musical) %>% 
  head(12) %>% 
  # Create pretty table
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>%
  row_spec(0, bold = T, color = "white", background = "#3f7689")
song musical year composer
Memory Cats 1981 Andrew Lloyd Webber
Children Will Listen Into the Woods 1986 Stephen Sondheim
The Music of the Night Phantom of the Opera 1986 Andrew Lloyd Webber
Getting to Know You The King and I 1951 Richard Rogers
Getting to Know You The King and I 1951 Oscar Hammerstein II
Shall We Dance? The King and I 1951 Richard Rogers
Shall We Dance? The King and I 1951 Oscar Hammerstein II
A Few of My Favorite Things The Sound of Music 1959 Richard Rogers
A Few of My Favorite Things The Sound of Music 1959 Oscar Hammerstein II
Edelweiss The Sound of Music 1959 Richard Rogers
Edelweiss The Sound of Music 1959 Oscar Hammerstein II
Maria West Side Story 1957 Leonard Bernstein

This time left_join() duplicates each row in show_songs that contains the Sound of Music twice, once for each row of composers that contains the Sound of Music. The result is four rows that contain the Sound of Music.

Missing keys

Julie Andrews is a famous singer who performed in The Sound of Music, which is documented in the singers data frame. The two_songs data frame records a couple of songs that she sang during her career.

Unfortunately, when you join the datasets together, something odd happens. According to the result, who sings A Spoonful of Sugar? Why does this happen?

You can avoid this outcome by removing rows that contain NAs in the keys before joining.

  • Missing keys can mistakenly join two rows that do not belong toegether
singers
## # A tibble: 2 x 2
##   movie              singer               
##   <chr>              <chr>                
## 1 <NA>               Arnold Schwarzenegger
## 2 The Sound of Music Julie Andrews
two_songs
## # A tibble: 2 x 2
##   song                movie             
##   <chr>               <chr>             
## 1 Do-Re-Mi            The Sound of Music
## 2 A Spoonful of Sugar <NA>
# Examine the result of joining singers to two_songs
two_songs %>% inner_join(singers, by = "movie") %>% 
  # Create pretty table
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>%
  row_spec(0, bold = T, color = "white", background = "#3f7689")
song movie singer
Do-Re-Mi The Sound of Music Julie Andrews
A Spoonful of Sugar NA Arnold Schwarzenegger
# Remove NA's from key before joining
two_songs %>% 
  filter(!is.na(movie)) %>% 
  inner_join(singers, by = "movie") %>% 
  # Create pretty table
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>%
  row_spec(0, bold = T, color = "white", background = "#3f7689")
song movie singer
Do-Re-Mi The Sound of Music Julie Andrews

This is the type of detail that dplyr is likely to handle itself in the future.

Defining the keys

  • If you do not set a by argument dplyr will join on all columns that match across the two datasets
  • You can use named vetors to match column with different names by = c("name1"" = "name2")
  • If there are non-key columns with the same names they wil show up as name.x and name.y
    • You can change this by adding the suffix argument. suffix = c("name1","name2")

Which keys?

Frank Sinatra and Bing Crosby each crossed the boundary from music to acting, singing famous songs in well-known movies. movie_years and movie_songs describe some of their work.
Examine movie_years and movie_songs to determine which keys dplyr would use to join the datasets if you did not set the by argument. In this case, it would not matter which type of join you use, or in which order you call the datasets.

movie_years
## # A tibble: 10 x 3
##    movie                   name           year
##    <chr>                   <chr>         <int>
##  1 The Road to Morocco     Bing Crosby    1942
##  2 Going My Way            Bing Crosby    1944
##  3 Anchors Aweigh          Frank Sinatra  1945
##  4 Till the Clouds Roll By Frank Sinatra  1946
##  5 White Christmas         Bing Crosby    1954
##  6 The Tender Trap         Frank Sinatra  1955
##  7 High Society            Bing Crosby    1956
##  8 The Joker is Wild       Frank Sinatra  1957
##  9 Pal Joey                Frank Sinatra  1957
## 10 Can-Can                 Frank Sinatra  1960
movie_songs
## # A tibble: 11 x 3
##    song                      movie                   name         
##    <chr>                     <chr>                   <chr>        
##  1 Moonlight Becomes You     The Road to Morocco     Bing Crosby  
##  2 Swinging on a Star        Going My Way            Bing Crosby  
##  3 Too-Ra-Loo-Ra-Loo-Ral     Going My Way            Bing Crosby  
##  4 I Fall in Love Too Easily Anchors Aweigh          Frank Sinatra
##  5 Ol' Man River             Till the Clouds Roll By Frank Sinatra
##  6 White Christmas           White Christmas         Bing Crosby  
##  7 (Love Is) The Tender Trap The Tender Trap         Frank Sinatra
##  8 Well, Did you Evah?       High Society            Bing Crosby  
##  9 All the Way               The Joker is Wild       Frank Sinatra
## 10 The Lady is a Tramp       Pal Joey                Frank Sinatra
## 11 I Love Paris              Can-Can                 Frank Sinatra
movie_years %>% 
  # Inner join movie_years to movie_songs
  inner_join(movie_songs) %>% 
  # Create pretty table
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>%
  row_spec(0, bold = T, color = "white", background = "#3f7689")
## Joining, by = c("movie", "name")
movie name year song
The Road to Morocco Bing Crosby 1942 Moonlight Becomes You
Going My Way Bing Crosby 1944 Swinging on a Star
Going My Way Bing Crosby 1944 Too-Ra-Loo-Ra-Loo-Ral
Anchors Aweigh Frank Sinatra 1945 I Fall in Love Too Easily
Till the Clouds Roll By Frank Sinatra 1946 Ol’ Man River
White Christmas Bing Crosby 1954 White Christmas
The Tender Trap Frank Sinatra 1955 (Love Is) The Tender Trap
High Society Bing Crosby 1956 Well, Did you Evah?
The Joker is Wild Frank Sinatra 1957 All the Way
Pal Joey Frank Sinatra 1957 The Lady is a Tramp
Can-Can Frank Sinatra 1960 I Love Paris

By default, dplyr will join by the combination of all column names that appear in both datasets.

A subset of keys

Often the same column name will be used by two datasets to refer to different things. For example, the data frame movie_studios uses name to refer to the name of a movie studio. movie_years uses name to refer to the name of an actor.

You could join these datasets (they describe the same movies), but you wouldn’t want to use the name column to do so!

dplyr will ignore duplicate column names if you set the by argument and do not include the duplicated name in the argument. When you do this, dplyr will treat the columns in the normal fashion, but it will add .x and .y to the duplicated names to help you tell the columns apart.

movie_years %>% 
  # Left join movie_studios to movie_years
  left_join(movie_studios, by = c("movie")) %>% 
  # Rename the columns: artist and studio
  rename(artist = name.x, studio = name.y) %>% 
  # Create pretty table
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>%
  row_spec(0, bold = T, color = "white", background = "#3f7689")
movie artist year studio
The Road to Morocco Bing Crosby 1942 Paramount Pictures
Going My Way Bing Crosby 1944 Paramount Pictures
Anchors Aweigh Frank Sinatra 1945 Metro-Goldwyn-Mayer
Till the Clouds Roll By Frank Sinatra 1946 Metro-Goldwyn-Mayer
White Christmas Bing Crosby 1954 Paramount Pictures
The Tender Trap Frank Sinatra 1955 Metro-Goldwyn-Mayer
High Society Bing Crosby 1956 Metro-Goldwyn-Mayer
The Joker is Wild Frank Sinatra 1957 Paramount Pictures
Pal Joey Frank Sinatra 1957 Columbia Pictures
Can-Can Frank Sinatra 1960 Twentieth-Century Fox

Mis-matched key names

Just as the same name can refer to different things in different datasets, different names can refer to the same thing. For example, elvis_movies and elvis_songs both describe movies starring Elvis Presley, but each uses a different column name to describe the name of the movie.

This type of inconsistency can be frustrating when you wish to join data based on the inconsistently named variable.

To make the join, set by to a named vector. The names of the vector will refer to column names in the primary dataset (x). The values of the vector will correspond to the column names in the secondary dataset (y), e.g.

x %>% left_join(y, by = c(“x.name1” = “y.name2”))

dplyr will make the join and retain the names in the primary dataset.

# Identify the key column
elvis_songs
## # A tibble: 5 x 2
##   name                                 movie         
##   <chr>                                <chr>         
## 1 (You're So Square) Baby I Don't Care Jailhouse Rock
## 2 I Can't Help Falling in Love         Blue Hawaii   
## 3 Jailhouse Rock                       Jailhouse Rock
## 4 Viva Las Vegas                       Viva Las Vegas
## 5 You Don't Know Me                    Clambake
elvis_movies
## # A tibble: 4 x 2
##   name            year
##   <chr>          <int>
## 1 Jailhouse Rock  1957
## 2 Blue Hawaii     1961
## 3 Viva Las Vegas  1963
## 4 Clambake        1967
elvis_movies %>% 
  # Left join elvis_songs to elvis_movies by this column
  left_join(elvis_songs, by = c("name" = "movie")) %>% 
  # Rename columns
  rename(movie = name, song = name.y) %>% 
  # Create pretty table
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>%
  row_spec(0, bold = T, color = "white", background = "#3f7689")
movie year song
Jailhouse Rock 1957 (You’re So Square) Baby I Don’t Care
Jailhouse Rock 1957 Jailhouse Rock
Blue Hawaii 1961 I Can’t Help Falling in Love
Viva Las Vegas 1963 Viva Las Vegas
Clambake 1967 You Don’t Know Me

More mismatched names

movie_years and movie_directors both describe movies that feature Frank Sinatra and Bing Crosby. However, each data frame uses different column names to refer to the key variables that connect the datasets.

Can you join the data?

# Identify the key columns
movie_directors
## # A tibble: 10 x 3
##    name                    director        studio               
##    <chr>                   <chr>           <chr>                
##  1 Anchors Aweigh          George Sidney   Metro-Goldwyn-Mayer  
##  2 Can-Can                 Walter Lang     Twentieth-Century Fox
##  3 Going My Way            Leo McCarey     Paramount Pictures   
##  4 High Society            Charles Walters Metro-Goldwyn-Mayer  
##  5 Pal Joey                George Sidney   Columbia Pictures    
##  6 The Joker is Wild       Charles Vidor   Paramount Pictures   
##  7 The Road to Morocco     David Butler    Paramount Pictures   
##  8 The Tender Trap         Charles Walters Metro-Goldwyn-Mayer  
##  9 Till the Clouds Roll By Richard Whorf   Metro-Goldwyn-Mayer  
## 10 White Christmas         Michael Curtiz  Paramount Pictures
movie_years
## # A tibble: 10 x 3
##    movie                   name           year
##    <chr>                   <chr>         <int>
##  1 The Road to Morocco     Bing Crosby    1942
##  2 Going My Way            Bing Crosby    1944
##  3 Anchors Aweigh          Frank Sinatra  1945
##  4 Till the Clouds Roll By Frank Sinatra  1946
##  5 White Christmas         Bing Crosby    1954
##  6 The Tender Trap         Frank Sinatra  1955
##  7 High Society            Bing Crosby    1956
##  8 The Joker is Wild       Frank Sinatra  1957
##  9 Pal Joey                Frank Sinatra  1957
## 10 Can-Can                 Frank Sinatra  1960
movie_years %>% 
  # Left join movie_directors to movie_years
  left_join(movie_directors, by = c("movie" = "name")) %>% 
  # Arrange the columns using select()
  select(year, movie, artist = name, director, studio) %>% 
  # Create pretty table
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>%
  row_spec(0, bold = T, color = "white", background = "#3f7689")
year movie artist director studio
1942 The Road to Morocco Bing Crosby David Butler Paramount Pictures
1944 Going My Way Bing Crosby Leo McCarey Paramount Pictures
1945 Anchors Aweigh Frank Sinatra George Sidney Metro-Goldwyn-Mayer
1946 Till the Clouds Roll By Frank Sinatra Richard Whorf Metro-Goldwyn-Mayer
1954 White Christmas Bing Crosby Michael Curtiz Paramount Pictures
1955 The Tender Trap Frank Sinatra Charles Walters Metro-Goldwyn-Mayer
1956 High Society Bing Crosby Charles Walters Metro-Goldwyn-Mayer
1957 The Joker is Wild Frank Sinatra Charles Vidor Paramount Pictures
1957 Pal Joey Frank Sinatra George Sidney Columbia Pictures
1960 Can-Can Frank Sinatra Walter Lang Twentieth-Century Fox

Joining multiple tables

  • To join multiple tables we could do a bunch of left_joins in a row
  • The purrr package has a function reduce() that lets us apply a function recursively to a list of datasets
  • purrr is also written by hadley wickham so it will work well with dplyr functions
  • the purpose of purrr is to help us apply R functions to data in efficient ways

Join multiple tables

  • Using reduce to do a left_join quick of all tables

purrr’s reduce() function is very useful for joining together multiple datasets. Let’s practice using it with three new datasets.

supergroups lists the members of several “supergroup” bands, which are bands made from the members of other famous bands. more_bands lists the original band membership of many musicians, including those in supergroups. And more_artists lists the instruments that various artists played.

Can you join these datasets into a single, coherent dataset?

# Load the purrr library
library(purrr)

# Place supergroups, more_bands, and more_artists into a list
list(supergroups, more_bands, more_artists) %>% 
  # Use reduce to join together the contents of the list
  reduce(left_join, by = c("first", "last")) %>% 
  # Create pretty table
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>%
  row_spec(0, bold = T, color = "white", background = "#3f7689")
supergroup first last band instrument
Cream Ginger Baker Graham Bond Organisation drums
Cream Jack Bruce Graham Bond Organisation bass
Cream Jack Bruce Graham Bond Organisation bass
Cream Jack Bruce Manfred Mann bass
Cream Jack Bruce Manfred Mann bass
Cream Eric Clapton The Yardbirds guitar
Crosby, Stills, Nash, and Young David Crosby The Byrds guitar
Crosby, Stills, Nash, and Young Graham Nash The Hollies guitar
Crosby, Stills, Nash, and Young Stephen Stills Buffalo Springfield guitar
Crosby, Stills, Nash, and Young Neil Young Buffalo Springfield guitar
The Highwaymen Johnny Cash NA guitar
The Highwaymen Waylon Jennings NA guitar
The Highwaymen Kris Kristofferson NA guitar
The Highwaymen Willie Nelson NA guitar
The Traveling Wilburys Bob Dylan NA guitar
The Traveling Wilburys George Harrison The Beatles guitar
The Traveling Wilburys Jeff Lynne Electric Light Orchestra vocals
The Traveling Wilburys Roy Orbison NA guitar
The Traveling Wilburys Tom Petty Tom Petty and the Heartbreakers vocals

Filter multiple tables

Nice work! The job of reduce() is to apply a function in an iterative fashion to many datasets. As a result, reduce() works well with all of the dplyr join functions.

For example, you can use reduce() to filter observations with a filtering join.

list(more_artists, more_bands, supergroups) %>% 
  # Return rows of more_artists in all three datasets
  reduce(semi_join, by = c("first", "last")) %>% 
  # Create pretty table
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>%
  row_spec(0, bold = T, color = "white", background = "#3f7689")
first last instrument
David Crosby guitar
Eric Clapton guitar
George Harrison guitar
Ginger Baker drums
Graham Nash guitar
Jack Bruce bass
Jack Bruce bass
Jeff Lynne vocals
Neil Young guitar
Stephen Stills guitar
Tom Petty vocals

The reduce idiom is a powerful tool and well worth knowing. In fact, it is a major component of ‘MapReduce’ in the Hadoop world.

Other implementations

  • We can do the same joins as we have been using here wih the merge function, but the syntax is more confusing
    • I will always just use dplyr joins
  • The dplyr joins translate to sql joins as show
    • These are good to know as often I am exploring a sql database before pulling data into R. It happens.
  • These are the DBMSs (Database management systems) you can easily connect to from R
    • Once you have connected you can write dplyr code and it will be translated to sql and run in the DBMS
    • This is incredible useful. Its worthy of its own corse.
    • checkout vignette("dbplyr") to see more about dplyr’s database capabilities
    • We can use collect to pull the data into R when we are ready or data.frame will also pull it in.

SQL

  • Each dplyr join has an SQL equivalent, and when you apply a dplyr join to a SQL data source (instead of a data frame), dplyr automatically converts your join to its SQL equivalent and runs it in the database.

Base R

dplyr joins are also similar to the merge() function in base R.

For example, the code in the editor uses merge() and arrange() to perform a left_join() on bands and artists.

# Alter the code to perform the join with a dplyr function
merge(bands, artists, by = c("first", "last"), all.x = TRUE) %>%
  arrange(band) 
##        first      last               band instrument
## 1      Jimmy      Page       Led Zeppelin     Guitar
## 2       John    Bonham       Led Zeppelin       <NA>
## 3  John Paul     Jones       Led Zeppelin       <NA>
## 4     Robert     Plant       Led Zeppelin       <NA>
## 5     George  Harrison        The Beatles     Guitar
## 6       John    Lennon        The Beatles     Guitar
## 7       Paul McCartney        The Beatles       Bass
## 8      Ringo     Starr        The Beatles      Drums
## 9      Jimmy   Buffett  The Coral Reefers     Guitar
## 10   Charlie     Watts The Rolling Stones       <NA>
## 11     Keith  Richards The Rolling Stones     Guitar
## 12      Mick    Jagger The Rolling Stones     Vocals
## 13    Ronnie      Wood The Rolling Stones       <NA>
bands %>% left_join(artists, by = c("first", "last"))
## # A tibble: 13 x 4
##    first     last      band               instrument
##    <chr>     <chr>     <chr>              <chr>     
##  1 John      Bonham    Led Zeppelin       <NA>      
##  2 John Paul Jones     Led Zeppelin       <NA>      
##  3 Jimmy     Page      Led Zeppelin       Guitar    
##  4 Robert    Plant     Led Zeppelin       <NA>      
##  5 George    Harrison  The Beatles        Guitar    
##  6 John      Lennon    The Beatles        Guitar    
##  7 Paul      McCartney The Beatles        Bass      
##  8 Ringo     Starr     The Beatles        Drums     
##  9 Jimmy     Buffett   The Coral Reefers  Guitar    
## 10 Mick      Jagger    The Rolling Stones Vocals    
## 11 Keith     Richards  The Rolling Stones Guitar    
## 12 Charlie   Watts     The Rolling Stones <NA>      
## 13 Ronnie    Wood      The Rolling Stones <NA>

Session info

sessionInfo()
## R version 3.5.2 (2018-12-20)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 16299)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=German_Switzerland.1252  LC_CTYPE=German_Switzerland.1252   
## [3] LC_MONETARY=German_Switzerland.1252 LC_NUMERIC=C                       
## [5] LC_TIME=German_Switzerland.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] purrr_0.3.0      tibble_2.0.1     ggplot2_3.1.0    dplyr_0.8.0.1   
## [5] gapminder_0.3.0  kableExtra_1.0.1 knitr_1.21      
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.0        highr_0.7         plyr_1.8.4       
##  [4] pillar_1.3.1      compiler_3.5.2    prettydoc_0.2.1  
##  [7] tools_3.5.2       digest_0.6.18     gtable_0.2.0     
## [10] evaluate_0.12     viridisLite_0.3.0 pkgconfig_2.0.2  
## [13] rlang_0.3.1       cli_1.0.1         rstudioapi_0.9.0 
## [16] yaml_2.2.0        xfun_0.4          withr_2.1.2      
## [19] httr_1.4.0        stringr_1.4.0     xml2_1.2.0       
## [22] hms_0.4.2         webshot_0.5.1     grid_3.5.2       
## [25] tidyselect_0.2.5  glue_1.3.0        R6_2.4.0         
## [28] fansi_0.4.0       rmarkdown_1.11    readr_1.3.1      
## [31] magrittr_1.5      codetools_0.2-15  scales_1.0.0     
## [34] htmltools_0.3.6   assertthat_0.2.0  rvest_0.3.2      
## [37] colorspace_1.4-0  utf8_1.1.4        stringi_1.3.1    
## [40] lazyeval_0.2.1    munsell_0.5.0     crayon_1.3.4