Whats Covered

Mutating joins
Filtering joins and set operations
Assembling data
Advanced joining
Case Study

Aditional Resources

Advanced joining

What can go wrong?

This chapter is about becoming a master at joining data
The main problems with joins come from missing or duplicate key values or columns
NA in the key column?
- Probably just need to remove these rows
Missing key column?
- it might be in the row name. use rownames_to_column in tibble package
duplicate keys?
- Might just need to include another column in the key criteria
- If you really have duplicates be careful as you will get every combination of key value matches across datasets.

Spot the key

R’s data frames can store important information in the row.names attribute. This is not a tidy way to store data, but it does happen quite commonly. If the primary key of your dataset is stored in row.names, you will have trouble joining it to other datasets.

For example, stage_songs contains information about songs that appear in musicals. However, it stores the primary key (song name) in the row.names attribute. As a result, you cannot access the key with a join function.

One way to remedy this problem is to use the function rownames_to_column() from the tibble package. rownames_to_column() returns a copy of a dataset with the row names added to the data as a column.

# Whoops, the primary key is in the row names for stage_songs
# We need to fix that before joining to stage_writers
stage_songs

##                                     musical year
## Children Will Listen         Into the Woods 1986
## Maria                       West Side Story 1957
## Memory                                 Cats 1981
## The Music of the Night Phantom of the Opera 1986

stage_writers

##                     song            composer
## 1   Children Will Listen    Stephen Sondheim
## 2                  Maria   Leonard Bernstein
## 3                 Memory Andrew Lloyd Webber
## 4 The Music of the Night Andrew Lloyd Webber

# Load the tibble package
library(tibble)

stage_songs %>% 
  # Add row names as a column named song
  rownames_to_column("song") %>% 
  # Left join stage_writers to stage_songs
  left_join(stage_writers, by = c("song")) %>% 
  # Create pretty table
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>%
  row_spec(0, bold = T, color = "white", background = "#3f7689")

song	musical	year	composer
Children Will Listen	Into the Woods	1986	Stephen Sondheim
Maria	West Side Story	1957	Leonard Bernstein
Memory	Cats	1981	Andrew Lloyd Webber
The Music of the Night	Phantom of the Opera	1986	Andrew Lloyd Webber

The tibble package contains many functions that began in dplyr and have since been moved to tibble

Non-unique keys

shows and composers provide some more information about songs that appear in musicals.

You can join the two datasets by the musical column, which appears in both the datasets. However, you will run into a glitch: the values of musical do not uniquely identify the rows of composers. For example, two rows have the value “The Sound of Music” for musical and two other rows have the value “The King and I”.

How many entries (rows) will dplyr create for The Sound of Music if you left join composers to shows by musical?

Non unique keys can be okay if you want the info for each entry and don’t mind it being duplicated.
- like years will get duplicated here because there are non unique keys in the musical column of composers dataset

shows

## # A tibble: 6 x 2
##   musical               year
##   <chr>                <int>
## 1 The Sound of Music    1959
## 2 Into the Woods        1986
## 3 The King and I        1951
## 4 West Side Story       1957
## 5 Cats                  1981
## 6 Phantom of the Opera  1986

# There are more than one composer for some musicals, so the musical column has duplicate key values. 
composers

## # A tibble: 8 x 2
##   musical              composer            
##   <chr>                <chr>               
## 1 Cats                 Andrew Lloyd Webber 
## 2 Into the Woods       Stephen Sondheim    
## 3 Phantom of the Opera Andrew Lloyd Webber 
## 4 The King and I       Richard Rogers      
## 5 The King and I       Oscar Hammerstein II
## 6 The Sound of Music   Richard Rogers      
## 7 The Sound of Music   Oscar Hammerstein II
## 8 West Side Story      Leonard Bernstein

# join by musical
shows %>% left_join(composers, by = "musical") %>% 
  # Create pretty table
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>%
  row_spec(0, bold = T, color = "white", background = "#3f7689")

musical	year	composer
The Sound of Music	1959	Richard Rogers
The Sound of Music	1959	Oscar Hammerstein II
Into the Woods	1986	Stephen Sondheim
The King and I	1951	Richard Rogers
The King and I	1951	Oscar Hammerstein II
West Side Story	1957	Leonard Bernstein
Cats	1981	Andrew Lloyd Webber
Phantom of the Opera	1986	Andrew Lloyd Webber

If a row in the primary dataset contains multiple matches in the secondary dataset, left_join() will duplicate the row once for every match.

Two non-unique keys

You saw in the last exercise that if a row in the primary dataset contains multiple matches in the secondary dataset, left_join() will duplicate the row once for every match. This is true for all of dplyr’s join functions.

Now, let’s see how this rule would apply when the primary dataset contains duplicate key values.

show_songs contains songs that appear in the musicals written by the composers. You can join the two by the musical column, but like composers, show_songs has two rows where musical == "The Sound of Music".

How many entries (rows) will exist for The Sound of Music if you left join composers to show_songs by musical?

show_songs

## # A tibble: 8 x 3
##   song                        musical               year
##   <chr>                       <chr>                <int>
## 1 A Few of My Favorite Things The Sound of Music    1959
## 2 Children Will Listen        Into the Woods        1986
## 3 Edelweiss                   The Sound of Music    1959
## 4 Getting to Know You         The King and I        1951
## 5 Maria                       West Side Story       1957
## 6 Memory                      Cats                  1981
## 7 Shall We Dance?             The King and I        1951
## 8 The Music of the Night      Phantom of the Opera  1986

show_songs %>% 
  left_join(composers, by = "musical") %>%
  arrange(musical) %>% 
  head(12) %>% 
  # Create pretty table
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>%
  row_spec(0, bold = T, color = "white", background = "#3f7689")

song	musical	year	composer
Memory	Cats	1981	Andrew Lloyd Webber
Children Will Listen	Into the Woods	1986	Stephen Sondheim
The Music of the Night	Phantom of the Opera	1986	Andrew Lloyd Webber
Getting to Know You	The King and I	1951	Richard Rogers
Getting to Know You	The King and I	1951	Oscar Hammerstein II
Shall We Dance?	The King and I	1951	Richard Rogers
Shall We Dance?	The King and I	1951	Oscar Hammerstein II
A Few of My Favorite Things	The Sound of Music	1959	Richard Rogers
A Few of My Favorite Things	The Sound of Music	1959	Oscar Hammerstein II
Edelweiss	The Sound of Music	1959	Richard Rogers
Edelweiss	The Sound of Music	1959	Oscar Hammerstein II
Maria	West Side Story	1957	Leonard Bernstein

This time left_join() duplicates each row in show_songs that contains the Sound of Music twice, once for each row of composers that contains the Sound of Music. The result is four rows that contain the Sound of Music.

Missing keys

Julie Andrews is a famous singer who performed in The Sound of Music, which is documented in the singers data frame. The two_songs data frame records a couple of songs that she sang during her career.

Unfortunately, when you join the datasets together, something odd happens. According to the result, who sings A Spoonful of Sugar? Why does this happen?

You can avoid this outcome by removing rows that contain NAs in the keys before joining.

Missing keys can mistakenly join two rows that do not belong toegether

singers

## # A tibble: 2 x 2
##   movie              singer               
##   <chr>              <chr>                
## 1 <NA>               Arnold Schwarzenegger
## 2 The Sound of Music Julie Andrews

two_songs

## # A tibble: 2 x 2
##   song                movie             
##   <chr>               <chr>             
## 1 Do-Re-Mi            The Sound of Music
## 2 A Spoonful of Sugar <NA>

# Examine the result of joining singers to two_songs
two_songs %>% inner_join(singers, by = "movie") %>% 
  # Create pretty table
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>%
  row_spec(0, bold = T, color = "white", background = "#3f7689")

song	movie	singer
Do-Re-Mi	The Sound of Music	Julie Andrews
A Spoonful of Sugar	NA	Arnold Schwarzenegger

# Remove NA's from key before joining
two_songs %>% 
  filter(!is.na(movie)) %>% 
  inner_join(singers, by = "movie") %>% 
  # Create pretty table
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>%
  row_spec(0, bold = T, color = "white", background = "#3f7689")

song	movie	singer
Do-Re-Mi	The Sound of Music	Julie Andrews

This is the type of detail that dplyr is likely to handle itself in the future.

Defining the keys

If you do not set a by argument dplyr will join on all columns that match across the two datasets
You can use named vetors to match column with different names by = c("name1"" = "name2")
If there are non-key columns with the same names they wil show up as name.x and name.y
- You can change this by adding the suffix argument. suffix = c("name1","name2")

Which keys?

Frank Sinatra and Bing Crosby each crossed the boundary from music to acting, singing famous songs in well-known movies. movie_years and movie_songs describe some of their work.
Examine movie_years and movie_songs to determine which keys dplyr would use to join the datasets if you did not set the by argument. In this case, it would not matter which type of join you use, or in which order you call the datasets.

movie_years

## # A tibble: 10 x 3
##    movie                   name           year
##    <chr>                   <chr>         <int>
##  1 The Road to Morocco     Bing Crosby    1942
##  2 Going My Way            Bing Crosby    1944
##  3 Anchors Aweigh          Frank Sinatra  1945
##  4 Till the Clouds Roll By Frank Sinatra  1946
##  5 White Christmas         Bing Crosby    1954
##  6 The Tender Trap         Frank Sinatra  1955
##  7 High Society            Bing Crosby    1956
##  8 The Joker is Wild       Frank Sinatra  1957
##  9 Pal Joey                Frank Sinatra  1957
## 10 Can-Can                 Frank Sinatra  1960

movie_songs

## # A tibble: 11 x 3
##    song                      movie                   name         
##    <chr>                     <chr>                   <chr>        
##  1 Moonlight Becomes You     The Road to Morocco     Bing Crosby  
##  2 Swinging on a Star        Going My Way            Bing Crosby  
##  3 Too-Ra-Loo-Ra-Loo-Ral     Going My Way            Bing Crosby  
##  4 I Fall in Love Too Easily Anchors Aweigh          Frank Sinatra
##  5 Ol' Man River             Till the Clouds Roll By Frank Sinatra
##  6 White Christmas           White Christmas         Bing Crosby  
##  7 (Love Is) The Tender Trap The Tender Trap         Frank Sinatra
##  8 Well, Did you Evah?       High Society            Bing Crosby  
##  9 All the Way               The Joker is Wild       Frank Sinatra
## 10 The Lady is a Tramp       Pal Joey                Frank Sinatra
## 11 I Love Paris              Can-Can                 Frank Sinatra

movie_years %>% 
  # Inner join movie_years to movie_songs
  inner_join(movie_songs) %>% 
  # Create pretty table
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>%
  row_spec(0, bold = T, color = "white", background = "#3f7689")

## Joining, by = c("movie", "name")

movie	name	year	song
The Road to Morocco	Bing Crosby	1942	Moonlight Becomes You
Going My Way	Bing Crosby	1944	Swinging on a Star
Going My Way	Bing Crosby	1944	Too-Ra-Loo-Ra-Loo-Ral
Anchors Aweigh	Frank Sinatra	1945	I Fall in Love Too Easily
Till the Clouds Roll By	Frank Sinatra	1946	Ol’ Man River
White Christmas	Bing Crosby	1954	White Christmas
The Tender Trap	Frank Sinatra	1955	(Love Is) The Tender Trap
High Society	Bing Crosby	1956	Well, Did you Evah?
The Joker is Wild	Frank Sinatra	1957	All the Way
Pal Joey	Frank Sinatra	1957	The Lady is a Tramp
Can-Can	Frank Sinatra	1960	I Love Paris

By default, dplyr will join by the combination of all column names that appear in both datasets.

A subset of keys

Often the same column name will be used by two datasets to refer to different things. For example, the data frame movie_studios uses name to refer to the name of a movie studio. movie_years uses name to refer to the name of an actor.

You could join these datasets (they describe the same movies), but you wouldn’t want to use the name column to do so!

dplyr will ignore duplicate column names if you set the by argument and do not include the duplicated name in the argument. When you do this, dplyr will treat the columns in the normal fashion, but it will add .x and .y to the duplicated names to help you tell the columns apart.

movie_years %>% 
  # Left join movie_studios to movie_years
  left_join(movie_studios, by = c("movie")) %>% 
  # Rename the columns: artist and studio
  rename(artist = name.x, studio = name.y) %>% 
  # Create pretty table
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>%
  row_spec(0, bold = T, color = "white", background = "#3f7689")

movie	artist	year	studio
The Road to Morocco	Bing Crosby	1942	Paramount Pictures
Going My Way	Bing Crosby	1944	Paramount Pictures
Anchors Aweigh	Frank Sinatra	1945	Metro-Goldwyn-Mayer
Till the Clouds Roll By	Frank Sinatra	1946	Metro-Goldwyn-Mayer
White Christmas	Bing Crosby	1954	Paramount Pictures
The Tender Trap	Frank Sinatra	1955	Metro-Goldwyn-Mayer
High Society	Bing Crosby	1956	Metro-Goldwyn-Mayer
The Joker is Wild	Frank Sinatra	1957	Paramount Pictures
Pal Joey	Frank Sinatra	1957	Columbia Pictures
Can-Can	Frank Sinatra	1960	Twentieth-Century Fox

Mis-matched key names

Just as the same name can refer to different things in different datasets, different names can refer to the same thing. For example, elvis_movies and elvis_songs both describe movies starring Elvis Presley, but each uses a different column name to describe the name of the movie.

This type of inconsistency can be frustrating when you wish to join data based on the inconsistently named variable.

To make the join, set by to a named vector. The names of the vector will refer to column names in the primary dataset (x). The values of the vector will correspond to the column names in the secondary dataset (y), e.g.

x %>% left_join(y, by = c(“x.name1” = “y.name2”))

dplyr will make the join and retain the names in the primary dataset.

# Identify the key column
elvis_songs

## # A tibble: 5 x 2
##   name                                 movie         
##   <chr>                                <chr>         
## 1 (You're So Square) Baby I Don't Care Jailhouse Rock
## 2 I Can't Help Falling in Love         Blue Hawaii   
## 3 Jailhouse Rock                       Jailhouse Rock
## 4 Viva Las Vegas                       Viva Las Vegas
## 5 You Don't Know Me                    Clambake

elvis_movies

## # A tibble: 4 x 2
##   name            year
##   <chr>          <int>
## 1 Jailhouse Rock  1957
## 2 Blue Hawaii     1961
## 3 Viva Las Vegas  1963
## 4 Clambake        1967

elvis_movies %>% 
  # Left join elvis_songs to elvis_movies by this column
  left_join(elvis_songs, by = c("name" = "movie")) %>% 
  # Rename columns
  rename(movie = name, song = name.y) %>% 
  # Create pretty table
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>%
  row_spec(0, bold = T, color = "white", background = "#3f7689")

movie	year	song
Jailhouse Rock	1957	(You’re So Square) Baby I Don’t Care
Jailhouse Rock	1957	Jailhouse Rock
Blue Hawaii	1961	I Can’t Help Falling in Love
Viva Las Vegas	1963	Viva Las Vegas
Clambake	1967	You Don’t Know Me

More mismatched names

movie_years and movie_directors both describe movies that feature Frank Sinatra and Bing Crosby. However, each data frame uses different column names to refer to the key variables that connect the datasets.

Can you join the data?

# Identify the key columns
movie_directors

## # A tibble: 10 x 3
##    name                    director        studio               
##    <chr>                   <chr>           <chr>                
##  1 Anchors Aweigh          George Sidney   Metro-Goldwyn-Mayer  
##  2 Can-Can                 Walter Lang     Twentieth-Century Fox
##  3 Going My Way            Leo McCarey     Paramount Pictures   
##  4 High Society            Charles Walters Metro-Goldwyn-Mayer  
##  5 Pal Joey                George Sidney   Columbia Pictures    
##  6 The Joker is Wild       Charles Vidor   Paramount Pictures   
##  7 The Road to Morocco     David Butler    Paramount Pictures   
##  8 The Tender Trap         Charles Walters Metro-Goldwyn-Mayer  
##  9 Till the Clouds Roll By Richard Whorf   Metro-Goldwyn-Mayer  
## 10 White Christmas         Michael Curtiz  Paramount Pictures

movie_years

## # A tibble: 10 x 3
##    movie                   name           year
##    <chr>                   <chr>         <int>
##  1 The Road to Morocco     Bing Crosby    1942
##  2 Going My Way            Bing Crosby    1944
##  3 Anchors Aweigh          Frank Sinatra  1945
##  4 Till the Clouds Roll By Frank Sinatra  1946
##  5 White Christmas         Bing Crosby    1954
##  6 The Tender Trap         Frank Sinatra  1955
##  7 High Society            Bing Crosby    1956
##  8 The Joker is Wild       Frank Sinatra  1957
##  9 Pal Joey                Frank Sinatra  1957
## 10 Can-Can                 Frank Sinatra  1960

movie_years %>% 
  # Left join movie_directors to movie_years
  left_join(movie_directors, by = c("movie" = "name")) %>% 
  # Arrange the columns using select()
  select(year, movie, artist = name, director, studio) %>% 
  # Create pretty table
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>%
  row_spec(0, bold = T, color = "white", background = "#3f7689")

year	movie	artist	director	studio
1942	The Road to Morocco	Bing Crosby	David Butler	Paramount Pictures
1944	Going My Way	Bing Crosby	Leo McCarey	Paramount Pictures
1945	Anchors Aweigh	Frank Sinatra	George Sidney	Metro-Goldwyn-Mayer
1946	Till the Clouds Roll By	Frank Sinatra	Richard Whorf	Metro-Goldwyn-Mayer
1954	White Christmas	Bing Crosby	Michael Curtiz	Paramount Pictures
1955	The Tender Trap	Frank Sinatra	Charles Walters	Metro-Goldwyn-Mayer
1956	High Society	Bing Crosby	Charles Walters	Metro-Goldwyn-Mayer
1957	The Joker is Wild	Frank Sinatra	Charles Vidor	Paramount Pictures
1957	Pal Joey	Frank Sinatra	George Sidney	Columbia Pictures
1960	Can-Can	Frank Sinatra	Walter Lang	Twentieth-Century Fox

Joining multiple tables

To join multiple tables we could do a bunch of left_joins in a row
The purrr package has a function reduce() that lets us apply a function recursively to a list of datasets
purrr is also written by hadley wickham so it will work well with dplyr functions
the purpose of purrr is to help us apply R functions to data in efficient ways

Join multiple tables

Using reduce to do a left_join quick of all tables

purrr’s reduce() function is very useful for joining together multiple datasets. Let’s practice using it with three new datasets.

supergroups lists the members of several “supergroup” bands, which are bands made from the members of other famous bands. more_bands lists the original band membership of many musicians, including those in supergroups. And more_artists lists the instruments that various artists played.

Can you join these datasets into a single, coherent dataset?

# Load the purrr library
library(purrr)

# Place supergroups, more_bands, and more_artists into a list
list(supergroups, more_bands, more_artists) %>% 
  # Use reduce to join together the contents of the list
  reduce(left_join, by = c("first", "last")) %>% 
  # Create pretty table
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>%
  row_spec(0, bold = T, color = "white", background = "#3f7689")

supergroup	first	last	band	instrument
Cream	Ginger	Baker	Graham Bond Organisation	drums
Cream	Jack	Bruce	Graham Bond Organisation	bass
Cream	Jack	Bruce	Graham Bond Organisation	bass
Cream	Jack	Bruce	Manfred Mann	bass
Cream	Jack	Bruce	Manfred Mann	bass
Cream	Eric	Clapton	The Yardbirds	guitar
Crosby, Stills, Nash, and Young	David	Crosby	The Byrds	guitar
Crosby, Stills, Nash, and Young	Graham	Nash	The Hollies	guitar
Crosby, Stills, Nash, and Young	Stephen	Stills	Buffalo Springfield	guitar
Crosby, Stills, Nash, and Young	Neil	Young	Buffalo Springfield	guitar
The Highwaymen	Johnny	Cash	NA	guitar
The Highwaymen	Waylon	Jennings	NA	guitar
The Highwaymen	Kris	Kristofferson	NA	guitar
The Highwaymen	Willie	Nelson	NA	guitar
The Traveling Wilburys	Bob	Dylan	NA	guitar
The Traveling Wilburys	George	Harrison	The Beatles	guitar
The Traveling Wilburys	Jeff	Lynne	Electric Light Orchestra	vocals
The Traveling Wilburys	Roy	Orbison	NA	guitar
The Traveling Wilburys	Tom	Petty	Tom Petty and the Heartbreakers	vocals

Filter multiple tables

Nice work! The job of reduce() is to apply a function in an iterative fashion to many datasets. As a result, reduce() works well with all of the dplyr join functions.

For example, you can use reduce() to filter observations with a filtering join.

list(more_artists, more_bands, supergroups) %>% 
  # Return rows of more_artists in all three datasets
  reduce(semi_join, by = c("first", "last")) %>% 
  # Create pretty table
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>%
  row_spec(0, bold = T, color = "white", background = "#3f7689")

first	last	instrument
David	Crosby	guitar
Eric	Clapton	guitar
George	Harrison	guitar
Ginger	Baker	drums
Graham	Nash	guitar
Jack	Bruce	bass
Jack	Bruce	bass
Jeff	Lynne	vocals
Neil	Young	guitar
Stephen	Stills	guitar
Tom	Petty	vocals

The reduce idiom is a powerful tool and well worth knowing. In fact, it is a major component of ‘MapReduce’ in the Hadoop world.

Other implementations

We can do the same joins as we have been using here wih the merge function, but the syntax is more confusing
- I will always just use dplyr joins
The dplyr joins translate to sql joins as show
- These are good to know as often I am exploring a sql database before pulling data into R. It happens.
These are the DBMSs (Database management systems) you can easily connect to from R
- Once you have connected you can write dplyr code and it will be translated to sql and run in the DBMS
- This is incredible useful. Its worthy of its own corse.
- checkout vignette("dbplyr") to see more about dplyr’s database capabilities
- We can use collect to pull the data into R when we are ready or data.frame will also pull it in.

SQL

Each dplyr join has an SQL equivalent, and when you apply a dplyr join to a SQL data source (instead of a data frame), dplyr automatically converts your join to its SQL equivalent and runs it in the database.

Base R

dplyr joins are also similar to the merge() function in base R.

For example, the code in the editor uses merge() and arrange() to perform a left_join() on bands and artists.

# Alter the code to perform the join with a dplyr function
merge(bands, artists, by = c("first", "last"), all.x = TRUE) %>%
  arrange(band)

##        first      last               band instrument
## 1      Jimmy      Page       Led Zeppelin     Guitar
## 2       John    Bonham       Led Zeppelin       <NA>
## 3  John Paul     Jones       Led Zeppelin       <NA>
## 4     Robert     Plant       Led Zeppelin       <NA>
## 5     George  Harrison        The Beatles     Guitar
## 6       John    Lennon        The Beatles     Guitar
## 7       Paul McCartney        The Beatles       Bass
## 8      Ringo     Starr        The Beatles      Drums
## 9      Jimmy   Buffett  The Coral Reefers     Guitar
## 10   Charlie     Watts The Rolling Stones       <NA>
## 11     Keith  Richards The Rolling Stones     Guitar
## 12      Mick    Jagger The Rolling Stones     Vocals
## 13    Ronnie      Wood The Rolling Stones       <NA>

bands %>% left_join(artists, by = c("first", "last"))

## # A tibble: 13 x 4
##    first     last      band               instrument
##    <chr>     <chr>     <chr>              <chr>     
##  1 John      Bonham    Led Zeppelin       <NA>      
##  2 John Paul Jones     Led Zeppelin       <NA>      
##  3 Jimmy     Page      Led Zeppelin       Guitar    
##  4 Robert    Plant     Led Zeppelin       <NA>      
##  5 George    Harrison  The Beatles        Guitar    
##  6 John      Lennon    The Beatles        Guitar    
##  7 Paul      McCartney The Beatles        Bass      
##  8 Ringo     Starr     The Beatles        Drums     
##  9 Jimmy     Buffett   The Coral Reefers  Guitar    
## 10 Mick      Jagger    The Rolling Stones Vocals    
## 11 Keith     Richards  The Rolling Stones Guitar    
## 12 Charlie   Watts     The Rolling Stones <NA>      
## 13 Ronnie    Wood      The Rolling Stones <NA>

Session info

sessionInfo()

## R version 3.5.2 (2018-12-20)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 16299)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=German_Switzerland.1252  LC_CTYPE=German_Switzerland.1252   
## [3] LC_MONETARY=German_Switzerland.1252 LC_NUMERIC=C                       
## [5] LC_TIME=German_Switzerland.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] purrr_0.3.0      tibble_2.0.1     ggplot2_3.1.0    dplyr_0.8.0.1   
## [5] gapminder_0.3.0  kableExtra_1.0.1 knitr_1.21      
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.0        highr_0.7         plyr_1.8.4       
##  [4] pillar_1.3.1      compiler_3.5.2    prettydoc_0.2.1  
##  [7] tools_3.5.2       digest_0.6.18     gtable_0.2.0     
## [10] evaluate_0.12     viridisLite_0.3.0 pkgconfig_2.0.2  
## [13] rlang_0.3.1       cli_1.0.1         rstudioapi_0.9.0 
## [16] yaml_2.2.0        xfun_0.4          withr_2.1.2      
## [19] httr_1.4.0        stringr_1.4.0     xml2_1.2.0       
## [22] hms_0.4.2         webshot_0.5.1     grid_3.5.2       
## [25] tidyselect_0.2.5  glue_1.3.0        R6_2.4.0         
## [28] fansi_0.4.0       rmarkdown_1.11    readr_1.3.1      
## [31] magrittr_1.5      codetools_0.2-15  scales_1.0.0     
## [34] htmltools_0.3.6   assertthat_0.2.0  rvest_0.3.2      
## [37] colorspace_1.4-0  utf8_1.1.4        stringi_1.3.1    
## [40] lazyeval_0.2.1    munsell_0.5.0     crayon_1.3.4

Joining Data in R with dplyr - Advanced joining

Guillaume Abgrall

2019-05-02